Theoretical Biology and Medical Modelling Open Access Nucleic Acid Chaperons: a Theory of an Rna-assisted Protein Folding

Background: Proteins are assumed to contain all the information necessary for unambiguous folding (Anfinsen's principle). However, ab initio structure prediction is often not successful because the amino acid sequence itself is not sufficient to guide between endless folding possibilities. It seems to be a logical to try to find the "missing" information in nucleic acids, in the redundant codon base.


Introduction
The protein folding problem has been one of the grand challenges in computational molecular biology. The problem is to predict the native three-dimensional structure of a protein from its amino acid sequence. It is widely believed that the amino acid sequence contains all the necessary information for the correct three-dimensional structure, since protein folding is apparently thermodynamically determined; i.e., given a proper environment, a protein will fold spontaneously to the correct conforma-tion. This is called Anfinsen's thermodynamic principle [1].
The thermodynamic principle has been confirmed many times on many different kinds of proteins in vitro. Critics says that the in vivo chemical conditions are different from those in vitro, correct protein folding is determined by interactions with other molecules (chaperons, hormones, substrate, etc.) and is much more complex than renaturation of denatured poly amino acids. The fact that many naturally-occurring proteins fold reliably and quickly to their native states, despite the astronomical number of possible configurations, has come to be known as Levinthal's paradox [2].
Anfinsen's principle was formulated in the 1960s using purely chemical experiments and a lot of intuition. Today, many sequences and structures are available to establish a logical and understandable link between sequence, structure and function. But it is still not possible to predict the structure (or a range of possible structures) correctly from the sequence alone, ab initio and in silico [3].
There are two potential, external sources of additional and specific protein folding information: (a) the chaperons (other proteins that assist in the folding of proteins and nucleic acids [4]; and (b) the protein-coding nucleic acid sequences themselves (which are templates for protein syntheses, but are not defined as chaperons). Protein chaperons are not necessarily similar to their clients; they can be complementary templates, too, as it is well known from nucleic acid interactions. However, chaperons necessarily contain spatial information (in some form) that guides another protein to fold correctly. Chaperoning requires subtle interactions with the immaturely folded intermediate so that its structure is loosened and it is then released for successive rounds of folding attempts. (Some aspects of this situation might be compared to enzymesubstrate interactions and kinetics.) The possibility that the nucleotide sequence itself could modulate translation and hence affect co-translational folding and assembly of proteins has been investigated in a number of studies [5][6][7]. Studies on the relationships between synonymous codon usage and protein secondary structural units are especially popular [8][9][10]. The genetic code is redundant (61 codons encode 20 amino acids) and as many as six synonymous codons can encode the same amino acid (Arg, Leu, Ser). The "wobble" base has no effect on the meaning of most codons, but codon usage (wobble usage) is nevertheless not randomly defined [11,12] and there are well-known, stable speciesspecific differences in codon usage. It seems to be reasonable to search for the meaning (biological purpose) of the wobble bases in association with protein folding.

Materials and methods
We have developed a tool, SeqX [13], which is specially designed to provide 2D projections of protein structures (residue contact maps) and analyze residue co-locations statistically in these structures. We have collected residue co-location statistics (residue contact statistics) from 80 different structures from the Protein Data Bank (PDB) [14]. This non-redundant SeqX data set listed ~35,000 amino acid co-locations (i.e. residues located within a 6 Å radius of the alpha carbon atoms; neighbor residues on the same strand were excluded).
The mfold tool was used to obtain RNA structure data [15] and the energy dot-plots provided by this program were used to estimate the site and size of the most probable RNA folding.
Student's t-tests were used for statistical evaluation.

Results and Discussion
The very first idea of protein folding on a nucleic acid template was the model of direct protein synthesis on the surface of dsDNA. It was suggested by George Gamow [16,17] before the discovery of the genetic code and mRNA. Gamow noticed that the distances between base pairs in DNA and the distances between amino acids in proteins are the same (~4 Å), and he suggested that complementary base pairs formed 20 different "cavities" on the surface of the DNA (one for each amino acid) where the amino acid residues aligned and became ligated. The correct translation turned out to be mRNA-mediated and stereochemical fitting between DNA and protein residues was rejected [18].
However, this question arose again in a different form. Specific DNA-protein interactions do exist (such as those between DNA and transcription factors or between restriction enzymes and recognition sequences), and it is difficult to explain the extreme specificity of these interactions without assuming that there is a "small scale", "residue level" interaction between nucleic acids and proteins. I found Woese's idea [19] of stereochemical fitting very attractive, i.e. affinity between codons and coded amino acids, in contrast to Crick's statement of a "frozen accident" [18]. I succeeded in constructing A Common Periodic Table of Codons and Amino Acids [20] and in showing a large number of codon-amino acid co-locations in restriction enzyme-recognition sequence structures [21]. Consequently, I support the view that the unit of specific nucleic acid-protein interactions is the codon and its amino acid.
Nucleic acids are structure-forming molecules. Perfect complementarity between Watson and Crick (WC) base pairs forms the perfect helical structure, dsDNA. However, partial or suboptimal WC complementarity in and between strands provides a large number of DNA/RNA structure variations. The structural variation of a given RNA might be large; some structures are energetically more favored, some are less. The importance of one RNA secondary structure over another is usually not a subject of debate, because RNA structure often has no known physiological significance (there are exceptions, e.g. tRNAs). Figure 1 Nucleo-protein structures and protein foldings. The distance between bases in nucleic acids (horizontal blue lines) and amino acids in proteins (red dots) is almost the same (A). This suggests the possibility of residue-level interactions between these molecules (B). Partial, sub-optimal complementarity between DNA strands ("honeycomb structure") and fitting of amino acids into DNA cavities was suggested by Gamow [16,17] (C, D). A further development of this model is that partial complementarity between mRNA subsequences (E, F) determines the orientation of amino acid residues in ribonucleoprotein complexes and consequently the RNA loops serve as templates (RNA chaperons) to main secondary protein structures such as alpha helices (G) and beta-sheets (H). Codon boundaries are not indicated in these models. The Figure illustrates the historical development of the concept of direct and specific nucleic acid -protein interactions and its possible consequences for protein folding.

Nucleo-protein structures and protein foldings
Proteins are also structure-forming molecules. However, in contrast to nucleic acids, there is no known specific amino acid complementarity, and the known physicochemical rules (charge, hydrophobe, size compatibility) are often insufficient to define only one obvious protein folding and structure (Biro, 2005, unpublished). The limitation of Anfinsen's theorem [1] is described by the Levinthal paradox [2], which is confirmed by the often frustrating outcome of ab initio protein prediction. However, we know that there is very little biological tolerance for variation in protein structure; usually only one main functioning structure is assigned to a protein sequence (and sometimes a few allosteric variants). The exact structure of a protein is critical, as is evident from our knowledge of prions. However, the primary sequence is usually insufficient to establish this exact structure and chaperons are required. The problem is not that there is a large choice of different protein folding pathways with different endpoints, only one of which is physiologically normal. Rather, the problem is the risk of deviation from the (physiological) folding pathway to form any one of a number of misfolded molecules. Chaperones are needed because the sequence is insufficient to define the most effective folding pathway leading to the thermodynamically most stable structure.
Chaperons are defined as proteins of which the function is to assist the folding of other proteins. However, the most obvious chaperons for me are nucleic acids; specifically, those coding the protein in question (Fig. 1). Immediate RNA-assisted protein folding prevents any protein misfolding at the site of protein synthesis itself. The insufficiency of folding information in protein sequences is more than compensated by the excess of information (codon base redundancy) in nucleic acids.
I compared the structures of mRNAs with those of the translated proteins to test the assumption that protein folding information is present in mRNA. The energy dotplots provided by mfold and the 2D protein structures provided by SeqX indeed suggest similarity in most of the randomly selected structures (Fig. 2) Another similar, but still not quantitative, comparison of protein and coding structures was performed on four proteins that are known to have very similar 3D structures although their primary structures (sequences) are less than 30% similar, and on the sequences of their mRNAs. These four proteins exemplify the fact that protein tertiary structure is much more conserved than the amino acid sequence. I asked whether this is also true for RNA structures and sequences. I found that there are signs of conservation even of RNA secondary structure (as indicated by the energy dot plots) and there are similarities between the protein and nucleic acid structures (Figure 3). These structural comparisons are suggestive, but not quantitative, and more convincing statistical evaluation is necessary to evaluate the significance of the suggested similarity between nucleic acid and corresponding protein structures. (Quantitative comparisons of 2D protein representations and RNA energy dot plots are possible and are in progress in our laboratory). Similarities between two macromolecules (RNA-RNA, protein-protein, RNAprotein), or even between two macromolecular families, does not automatically mean that they are functionally related to each other (or that one is a chaperon), but it is a widely accepted sign of a biologically significant relationship.
The molecular basis of mRNA structure formation is the known WC base pair complementarity. Therefore I asked whether it is possible to find some kind of complementarity between the codons of co-locating (specifically interacting) amino acids.
Searching for some pattern in the codons of co-locating amino acids, the frequency of the eight possible patterns in the 64 nucleic acid triplets was analyzed. The codons were either complementary to each other in all three (-123-) or in at least two codon base positions (-12X-, -1X3-, -X23-). In these latter cases the codon complementarity was partial, because complementarity was not required for one position (X). The complementary codons were translated in the same (5' > 3' and 3' > 5', only complementary, C) or the reversed and complementary (5' > 3' and 5' > 3', RC) directions. One (and only one) codon complementary pattern of the eight possible turned out to be significantly overrepresented among the codons of colocating amino acids: D-1X3/RC-3X1. The other 7 possible codon patterns served as negative controls. This pattern means that the 1st and 3rd codon residues are complementary in reverse orientation, but the 2nd residue may be but is not necessarily complementary (X) (Fig.  4). The possible amino acid pairs determined by the D-1X3/RC-3X1 formula are indicated in Table I. This partial, suboptimal complementarity again suggests that mRNA folding may assist protein folding, but does not necessarily prove it. An alternative explanation is that it is only a sign of the biochemical origin of specifically interacting amino acid pairs (they are encoded in partially complementary codons) but does not mean that complementary structures in amino acids will form interacting protein strands.
The historical concept of specific nucleic acid -protein interactions and the subsequent possibility of RNAassisted protein folding was illustrated in figure 1. I wish to suggest a further development of these ideas. The distance between codons is about three times larger than the distance between amino acids and therefore complete 1 by 1 RNA-protein alignment is not possible. Furthermore, a long continuous alignment would create problems in dissociating the nucleoprotein complexes. Therefore I sug-gest that only some basic (positively charged) amino acids remain attached to their codons (or become re-attached after removal of tRNA). If this attachment point is followed by a loop in the mRNA, a corresponding loop will Comparison of 12 randomly selected protein and corresponding mRNA structures  be formed in the nascent protein ( Figure 5). The interaction between the positively charged amino acid and the negatively charged codon will be successively weakened by the growing protein loop and finally interrupted, for example, by the translation of a negatively charged amino acid. It is known that interactions between nucleic acids and proteins often involve only a few amino acids and that these "patchy" interaction sites often contain an arginine [21]. Complex protein structures might be folded in this way ( Figure 6).
The observed partial complementary coding of co-locating amino acids (the D_1X3/RC_3X1 formula) raises a series of interesting questions. The 20 amino acid -triplet codon model, obviously entails the need for a third codon base (two nucleotides are simply not enough). However, based on the assumption of RNA chaperons, two proteins with identical primary structures (for example human and chimpanzee Hb) may fold differently if there are differences in the redundant codon base positions. Similarly, a number of SNPs (Single Nucleotide Polymorphisms) that do not change the coded amino acids may result in protein structure variations.

The medical genetics literature (for example OMIM) is full of annotations concerning wobble base mutations and it
Complementary codes vs. amino acid co-locations Figure 4 Complementary codes vs. amino acid co-locations. (A) The propensity of the 400 possible amino acid pairs was monitored in 80 different protein structures with the SeqX tool. The tool detected co-locations when two amino acids were closer than 6 Å to each other (neighbors on the same strand were excluded). The total number of co-locations was 34,630. Eight different complementary codes were constructed for the codons (two optimal and six suboptimal). In the two optimal codes all three codon residues (123) were complementary (C) or reverse-complementary (RC) to each other. In the suboptimal codes only two of three codon residues were C or RC to each other (12,13,23), while the third was not necessarily complementary (X). (For example, complementary code RC_1X3 means that the first and third codon letters are always complementary, but not the second, and the possible codons are read in reverse orientation). The 400 co-locations were divided into 20 subgroups corresponding to 20 amino acids (one of the co-locating pairs), each group containing the 20 amino acids (corresponding to the other amino acid in the co-locating pair). If the codons of the amino acid pairs followed the predefined complementary code, the co-location was regarded as positive (P); if not, the co-location was regarded as negative (N). Each symbol represents the mean frequency of P or N co-locations corresponding to the indicated amino acid. Paired Student's t-test, n = 20 (see Fig.  2 for explanation). (B) The ratio of positive (P) and negative (N) co-locations was calculated on data from (A). Each bar represents the mean ± SEM, n = 20.
is usually inferred that these "translationally silent" mutations are unlikely to cause disease. A famous exception is prion diseases (mad cow disease, Creutzfeldt-Jakob disease [22]). This large group of diseases is characterized by the presences of an abnormally folded protein (PrP sc ) instead of the normally folded one (PrP C ). The physiological and abnormal proteins have the same primary structures; only the secondary structures are different. In most cases the disease is acquired by infection, but there are many inherited forms. At least 42 known point mutations, 24 causative and 18 translationally silent, are described in the literature [23]. The wobble base mutations demand serious attention, especially since it is known that selection pressure exists for the wobble bases in some codon positions [24].
The RNA chaperon theory does not mean that every wobble-base point-mutation (or SNP) influences secondary structure. Usually, many codons and amino acids are involved in the formation of a simple secondary structure element (helix, sheet, turn) and probably most mutations have no structural consequences. Also, many mutations are accompanied by a second, compensatory mutation that corrects the structural consequences of the first. In evolution, sequence changes more rapidly than structure; however, many sequence changes are compensatory and preserve local physicochemical characteristics. For example, if an amino acid side chain is particularly bulky with respect to the average at a given position in a given sequence, this might have been compensated in evolution by a particularly small side chain in a neighbouring position, preserving the general structural motif [25].
An additional question raised by the RNA-chapeon hypotheses concerns the GC versus AT contents of various genomes, which range from 78 / 22 to 22 / 76. This causes marked differences, especially in the compositions of the third codon nucleotides. It is reasonable to suppose that redundant codon bases are susceptible to much more variation if there is no amino acid replacement, and that if such changes affect protein folding, this would have restrained such nucleotide replacements significantly. However this is not necessarily true. The partial complementary coding of co-locating amino acids (the D_1X3/ RC_3X1 rule) suggests that the number of possible amino acid co-locations is less than 200 (20 × 20/2), and the possible co-locations involve pairings of physicochemically compatible amino acids (Biro, 2005, unpublished). Many  T  G  G  T  G C  A  A  CT  A  A C C  AC  AT  A G T  T   2nd C G  A  A  T  G A  T  A  T  T  A C A  G  CG C T  non-silent mutations in one codon are coupled to a second (silent or non-silent) mutation in a second codon. This coupled and coordinated model of mutations actually permits a very large number of variations in the pri-mary nucleic acid and protein sequences with no consequences for nucleic acid or protein secondary structures. And as indicated above, 3D structures are generally much more conserved than sequences. When translation proceeds to an amino acid with especially high affinity to the mRNAattached arginine, for example a negatively charged Glu or Asp [(-) and blue dot], the charge attraction removes the Arg from its mRNA binding site and the entire protein is released from the mRNA and completes a protein loop (C). The protein continues to grow toward the direction of its carboxy terminal (COOH).