- Open Access
Nucleic acid chaperons: a theory of an RNA-assisted protein folding
Theoretical Biology and Medical Modellingvolume 2, Article number: 35 (2005)
Proteins are assumed to contain all the information necessary for unambiguous folding (Anfinsen's principle). However, ab initio structure prediction is often not successful because the amino acid sequence itself is not sufficient to guide between endless folding possibilities. It seems to be a logical to try to find the "missing" information in nucleic acids, in the redundant codon base.
mRNA energy dot plots and protein residue contact maps were found to be rather similar. The structure of mRNA is also conserved if the protein structure is conserved, even if the sequence similarity is low. These observations led me to suppose that some similarity might exist between nucleic acid and protein folding. I found that amino acid pairs, which are co-located in the protein structure, are preferentially coded by complementary codons. This codon complementarity is not perfect; it is suboptimal where the 1st and 3rd codon residues are complementary to each other in reverse orientation, while the 2nd codon letters may be, but are not necessarily, complementary.
Partial complementary coding of co-locating amino acids in protein structures suggests that mRNA assists in protein folding and functions not only as a template but even as a chaperon during translation. This function explains the role of wobble bases and answers the mystery of why we have a redundant codon base.
The protein folding problem has been one of the grand challenges in computational molecular biology. The problem is to predict the native three-dimensional structure of a protein from its amino acid sequence. It is widely believed that the amino acid sequence contains all the necessary information for the correct three-dimensional structure, since protein folding is apparently thermodynamically determined; i.e., given a proper environment, a protein will fold spontaneously to the correct conformation. This is called Anfinsen's thermodynamic principle .
The thermodynamic principle has been confirmed many times on many different kinds of proteins in vitro. Critics says that the in vivo chemical conditions are different from those in vitro, correct protein folding is determined by interactions with other molecules (chaperons, hormones, substrate, etc.) and is much more complex than renaturation of denatured poly amino acids. The fact that many naturally-occurring proteins fold reliably and quickly to their native states, despite the astronomical number of possible configurations, has come to be known as Levinthal's paradox .
Anfinsen's principle was formulated in the 1960s using purely chemical experiments and a lot of intuition. Today, many sequences and structures are available to establish a logical and understandable link between sequence, structure and function. But it is still not possible to predict the structure (or a range of possible structures) correctly from the sequence alone, ab initio and in silico .
There are two potential, external sources of additional and specific protein folding information: (a) the chaperons (other proteins that assist in the folding of proteins and nucleic acids ; and (b) the protein-coding nucleic acid sequences themselves (which are templates for protein syntheses, but are not defined as chaperons). Protein chaperons are not necessarily similar to their clients; they can be complementary templates, too, as it is well known from nucleic acid interactions. However, chaperons necessarily contain spatial information (in some form) that guides another protein to fold correctly. Chaperoning requires subtle interactions with the immaturely folded intermediate so that its structure is loosened and it is then released for successive rounds of folding attempts. (Some aspects of this situation might be compared to enzyme-substrate interactions and kinetics.)
The possibility that the nucleotide sequence itself could modulate translation and hence affect co-translational folding and assembly of proteins has been investigated in a number of studies [5–7]. Studies on the relationships between synonymous codon usage and protein secondary structural units are especially popular [8–10]. The genetic code is redundant (61 codons encode 20 amino acids) and as many as six synonymous codons can encode the same amino acid (Arg, Leu, Ser). The "wobble" base has no effect on the meaning of most codons, but codon usage (wobble usage) is nevertheless not randomly defined [11, 12] and there are well-known, stable species-specific differences in codon usage. It seems to be reasonable to search for the meaning (biological purpose) of the wobble bases in association with protein folding.
Materials and methods
We have developed a tool, SeqX , which is specially designed to provide 2D projections of protein structures (residue contact maps) and analyze residue co-locations statistically in these structures. We have collected residue co-location statistics (residue contact statistics) from 80 different structures from the Protein Data Bank (PDB) . This non-redundant SeqX data set listed ~35,000 amino acid co-locations (i.e. residues located within a 6 Å radius of the alpha carbon atoms; neighbor residues on the same strand were excluded).
The mfold tool was used to obtain RNA structure data  and the energy dot-plots provided by this program were used to estimate the site and size of the most probable RNA folding.
Student's t-tests were used for statistical evaluation.
Results and Discussion
The very first idea of protein folding on a nucleic acid template was the model of direct protein synthesis on the surface of dsDNA. It was suggested by George Gamow [16, 17] before the discovery of the genetic code and mRNA. Gamow noticed that the distances between base pairs in DNA and the distances between amino acids in proteins are the same (~4 Å), and he suggested that complementary base pairs formed 20 different "cavities" on the surface of the DNA (one for each amino acid) where the amino acid residues aligned and became ligated. The correct translation turned out to be mRNA-mediated and stereochemical fitting between DNA and protein residues was rejected .
However, this question arose again in a different form. Specific DNA-protein interactions do exist (such as those between DNA and transcription factors or between restriction enzymes and recognition sequences), and it is difficult to explain the extreme specificity of these interactions without assuming that there is a "small scale", "residue level" interaction between nucleic acids and proteins. I found Woese's idea  of stereochemical fitting very attractive, i.e. affinity between codons and coded amino acids, in contrast to Crick's statement of a "frozen accident" . I succeeded in constructing A Common Periodic Table of Codons and Amino Acids  and in showing a large number of codon-amino acid co-locations in restriction enzyme-recognition sequence structures . Consequently, I support the view that the unit of specific nucleic acid-protein interactions is the codon and its amino acid.
Nucleic acids are structure-forming molecules. Perfect complementarity between Watson and Crick (WC) base pairs forms the perfect helical structure, dsDNA. However, partial or suboptimal WC complementarity in and between strands provides a large number of DNA/RNA structure variations. The structural variation of a given RNA might be large; some structures are energetically more favored, some are less. The importance of one RNA secondary structure over another is usually not a subject of debate, because RNA structure often has no known physiological significance (there are exceptions, e.g. tRNAs).
Proteins are also structure-forming molecules. However, in contrast to nucleic acids, there is no known specific amino acid complementarity, and the known physicochemical rules (charge, hydrophobe, size compatibility) are often insufficient to define only one obvious protein folding and structure (Biro, 2005, unpublished). The limitation of Anfinsen's theorem  is described by the Levinthal paradox , which is confirmed by the often frustrating outcome of ab initio protein prediction. However, we know that there is very little biological tolerance for variation in protein structure; usually only one main functioning structure is assigned to a protein sequence (and sometimes a few allosteric variants). The exact structure of a protein is critical, as is evident from our knowledge of prions. However, the primary sequence is usually insufficient to establish this exact structure and chaperons are required. The problem is not that there is a large choice of different protein folding pathways with different end-points, only one of which is physiologically normal. Rather, the problem is the risk of deviation from the (physiological) folding pathway to form any one of a number of misfolded molecules. Chaperones are needed because the sequence is insufficient to define the most effective folding pathway leading to the thermodynamically most stable structure.
Chaperons are defined as proteins of which the function is to assist the folding of other proteins. However, the most obvious chaperons for me are nucleic acids; specifically, those coding the protein in question (Fig. 1). Immediate RNA-assisted protein folding prevents any protein misfolding at the site of protein synthesis itself. The insufficiency of folding information in protein sequences is more than compensated by the excess of information (codon base redundancy) in nucleic acids.
I compared the structures of mRNAs with those of the translated proteins to test the assumption that protein folding information is present in mRNA. The energy dot-plots provided by mfold and the 2D protein structures provided by SeqX indeed suggest similarity in most of the randomly selected structures (Fig. 2)
Another similar, but still not quantitative, comparison of protein and coding structures was performed on four proteins that are known to have very similar 3D structures although their primary structures (sequences) are less than 30% similar, and on the sequences of their mRNAs. These four proteins exemplify the fact that protein tertiary structure is much more conserved than the amino acid sequence. I asked whether this is also true for RNA structures and sequences. I found that there are signs of conservation even of RNA secondary structure (as indicated by the energy dot plots) and there are similarities between the protein and nucleic acid structures (Figure 3).
These structural comparisons are suggestive, but not quantitative, and more convincing statistical evaluation is necessary to evaluate the significance of the suggested similarity between nucleic acid and corresponding protein structures. (Quantitative comparisons of 2D protein representations and RNA energy dot plots are possible and are in progress in our laboratory). Similarities between two macromolecules (RNA-RNA, protein-protein, RNA-protein), or even between two macromolecular families, does not automatically mean that they are functionally related to each other (or that one is a chaperon), but it is a widely accepted sign of a biologically significant relationship.
The molecular basis of mRNA structure formation is the known WC base pair complementarity. Therefore I asked whether it is possible to find some kind of complementarity between the codons of co-locating (specifically interacting) amino acids.
Searching for some pattern in the codons of co-locating amino acids, the frequency of the eight possible patterns in the 64 nucleic acid triplets was analyzed. The codons were either complementary to each other in all three (-123-) or in at least two codon base positions (-12X-, -1X3-, -X23-). In these latter cases the codon complementarity was partial, because complementarity was not required for one position (X). The complementary codons were translated in the same (5' > 3' and 3' > 5', only complementary, C) or the reversed and complementary (5' > 3' and 5' > 3', RC) directions. One (and only one) codon complementary pattern of the eight possible turned out to be significantly overrepresented among the codons of co-locating amino acids: D-1X3/RC-3X1. The other 7 possible codon patterns served as negative controls. This pattern means that the 1st and 3rd codon residues are complementary in reverse orientation, but the 2nd residue may be but is not necessarily complementary (X) (Fig. 4). The possible amino acid pairs determined by the D-1X3/RC-3X1 formula are indicated in Table I.
This partial, suboptimal complementarity again suggests that mRNA folding may assist protein folding, but does not necessarily prove it. An alternative explanation is that it is only a sign of the biochemical origin of specifically interacting amino acid pairs (they are encoded in partially complementary codons) but does not mean that complementary structures in amino acids will form interacting protein strands.
The historical concept of specific nucleic acid – protein interactions and the subsequent possibility of RNA-assisted protein folding was illustrated in figure 1. I wish to suggest a further development of these ideas. The distance between codons is about three times larger than the distance between amino acids and therefore complete 1 by 1 RNA-protein alignment is not possible. Furthermore, a long continuous alignment would create problems in dissociating the nucleoprotein complexes. Therefore I suggest that only some basic (positively charged) amino acids remain attached to their codons (or become re-attached after removal of tRNA). If this attachment point is followed by a loop in the mRNA, a corresponding loop will be formed in the nascent protein (Figure 5). The interaction between the positively charged amino acid and the negatively charged codon will be successively weakened by the growing protein loop and finally interrupted, for example, by the translation of a negatively charged amino acid. It is known that interactions between nucleic acids and proteins often involve only a few amino acids and that these "patchy" interaction sites often contain an arginine . Complex protein structures might be folded in this way (Figure 6).
The observed partial complementary coding of co-locating amino acids (the D_1X3/RC_3X1 formula) raises a series of interesting questions. The 20 amino acid – triplet codon model, obviously entails the need for a third codon base (two nucleotides are simply not enough). However, based on the assumption of RNA chaperons, two proteins with identical primary structures (for example human and chimpanzee Hb) may fold differently if there are differences in the redundant codon base positions. Similarly, a number of SNPs (Single Nucleotide Polymorphisms) that do not change the coded amino acids may result in protein structure variations.
The medical genetics literature (for example OMIM) is full of annotations concerning wobble base mutations and it is usually inferred that these "translationally silent" mutations are unlikely to cause disease. A famous exception is prion diseases (mad cow disease, Creutzfeldt-Jakob disease ). This large group of diseases is characterized by the presences of an abnormally folded protein (PrPsc) instead of the normally folded one (PrPC). The physiological and abnormal proteins have the same primary structures; only the secondary structures are different. In most cases the disease is acquired by infection, but there are many inherited forms. At least 42 known point mutations, 24 causative and 18 translationally silent, are described in the literature . The wobble base mutations demand serious attention, especially since it is known that selection pressure exists for the wobble bases in some codon positions .
The RNA chaperon theory does not mean that every wobble-base point-mutation (or SNP) influences secondary structure. Usually, many codons and amino acids are involved in the formation of a simple secondary structure element (helix, sheet, turn) and probably most mutations have no structural consequences. Also, many mutations are accompanied by a second, compensatory mutation that corrects the structural consequences of the first. In evolution, sequence changes more rapidly than structure; however, many sequence changes are compensatory and preserve local physicochemical characteristics. For example, if an amino acid side chain is particularly bulky with respect to the average at a given position in a given sequence, this might have been compensated in evolution by a particularly small side chain in a neighbouring position, preserving the general structural motif .
An additional question raised by the RNA-chapeon hypotheses concerns the GC versus AT contents of various genomes, which range from 78 / 22 to 22 / 76. This causes marked differences, especially in the compositions of the third codon nucleotides. It is reasonable to suppose that redundant codon bases are susceptible to much more variation if there is no amino acid replacement, and that if such changes affect protein folding, this would have restrained such nucleotide replacements significantly. However this is not necessarily true. The partial complementary coding of co-locating amino acids (the D_1X3/RC_3X1 rule) suggests that the number of possible amino acid co-locations is less than 200 (20 × 20/2), and the possible co-locations involve pairings of physicochemically compatible amino acids (Biro, 2005, unpublished). Many non-silent mutations in one codon are coupled to a second (silent or non-silent) mutation in a second codon. This coupled and coordinated model of mutations actually permits a very large number of variations in the primary nucleic acid and protein sequences with no consequences for nucleic acid or protein secondary structures. And as indicated above, 3D structures are generally much more conserved than sequences.
Complementary coding of co-locating amino acids, and the consequent possibility of nucleic acid assisted protein folding (nucleic acid chaperon), might give new insights into the dilemma of why we have a redundant codon base and might explain the role of the wobble base in the codon. Experimental, in vitro support is necessary to confirm this in silico suggestion of nucleic acid chaperons.
Anfinsen CB, Redfield RR, Choate WI, Page J, Carroll WR: Studies on the gross structure, cross-linkages, and terminal sequences in ribonuclease. J Biol Chem. 1954, 207: 201-210.
Levinthal C: How to fold graciously in Mossbauer spectroscopy in biological systems. Proceedings of a Meeting held at Allerton House, Monticello, IL. Edited by: Debrunner P, Tsibris JCM, Munck E. 1969, Urbana, IL: University of Illinois Press, 22-24.
Klepeis JL, Floudas AC: ASTRA-FOLD: a combinatorial and global optimization framework for ab initio prediction of three-dimensional structures of proteins from the amino acid sequence. Biochem J. 2003, 85: 2119-2146.
Walter S, Buchner J: Molecular chaperones – cellular machines for protein folding. Angew Chem Int Ed Engl. 2002, 41: 1098-1113. 10.1002/1521-3773(20020402)41:7<1098::AID-ANIE1098>3.0.CO;2-9.
Komar AA, Kommer A, Krasheninnikov IA, Spirin AS: Cotranslational folding of globin. J Biol Chem. 1997, 272: 10646-10651. 10.1074/jbc.272.16.10646.
Thanaraj TA, Argos P: Protein secondary structural types are differentially coded on messenger RNA. Protein Sci. 1996, 5: 1973-1983.
Brunak S, Engelbrecht J: Protein structure and the sequential structure of mRNA: alpha-helix and beta-sheet signals at the nucleotide level. Proteins. 1996, 25: 237-252. 10.1002/(SICI)1097-0134(199606)25:2<237::AID-PROT9>3.3.CO;2-Y.
Gupta SK, Majumdar S, Bhattacharya TK, Ghosh TC: Studies on the relationships between the synonymous codon usage and protein secondary structural units. Biochem Biophys Res Commun. 2000, 269: 692-696. 10.1006/bbrc.2000.2351.
Chiusano ML, Alvarez-Valin F, Di Giulio M, D'Onofrio G, Ammirato G, Colonna G, Bernardi G: Second codon positions of genes and the secondary structures of proteins. Relationships and implications for the origin of the genetic code. Gene. 2000, 261: 63-69. 10.1016/S0378-1119(00)00521-7.
Gu W, Zhou T, Ma J, Sun X, Lu Z: The relationship between synonymous codon usage and protein structure in Escherichia coli and Homo sapiens. Biosystems. 2004, 73: 89-97. 10.1016/j.biosystems.2003.10.001.
Ermolaeva O: Synonymous codon usage in bacteria. Curr Issues Mol Biol. 2001, 3: 91-97.
Biro JC, Biro JM, Biro AM: Hidden massages in hidden sub-sequences: a study on collagens. 30th FEBS Congress – 9th IUBMB Conference, Budapest, Hungary, 2–7 July 2005. 2005, abstract.
Biro JC, Fördös G: SeqX: a tool to detect, analyze and visualize residue co-locations in protein and nucleic acid structures. BMC Bioinformatics. 2005, [SeqX: http://www.janbiro.com/download]., [SeqX: ].
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235. [PDB http://www.pdb.org/]. [PDB ].
Zuker M: Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003, 31: 3406-3415. 10.1093/nar/gkg595.
Gamow G: Possible relation between deoxyribonucleic acid and protein structures. Nature. 1954, 173: 318-
Gamow G: Possible mathematical relation between deoxyribonucleic acid and proteins. Biol Med. 1954, 22: 1-13.
Crick FHC: The origin of the genetic code. J Mol Biol. 1968, 38: 367-379. 10.1016/0022-2836(68)90392-6.
Woese CR: The Genetic Code: The Molecular Basis for Gene Expression. 1967, New York: Harper & Row, 156-160.
Biro JC, Benyo B, Sansom C, Szlavecz A, Fordos G, Micsik T, Benyo Z: A common periodic table of codons and amino acids. Biochem Biophys Res Commun. 2003, 306: 408-415. 10.1016/S0006-291X(03)00974-4.
Biro JC, Biro JM: Frequent occurrence of recognition site-like sequences in the restriction endonucleases. BMC Bioinformatics. 2004, 5: 30-10.1186/1471-2105-5-30.
Prusiner SB: Prions. PNAS. 1998, 95: 13363-13383. 10.1073/pnas.95.23.13363.
The Official Mad Cow Disease Home Page – Known point variations in human prion gene coding region – Last updated 16 Nov 2000http://www.mad-cow.org/prion_point_mutations.html
Luck R, Steger G, Riesner D: Secondary structure of prion mRNA. J Mol Biol. 1996, 258: 813-26. 10.1006/jmbi.1996.0289.
Neher E: How frequent are correlated changes in families of protein sequences?. Proc Natl Acad Sci USA. 1994, 91: 98-102.
Declaration of competing interests
The author(s) declare that they have no competing interests.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.