A more complex isoleucine aptamer with a cognate triplet.

The simplest RNA that can meet a column affinity selection for isoleucine was previously defined using selection amplification with decreasing numbers of randomized nucleotides. This simplest UAUU motif was a small asymmetric internal loop. Conserved positions of the loop include isoleucine codon and anticodon triplets (Lozupone C., Changayil, S., Majerfeld, I., and Yarus, M. (2003) RNA (N. Y.) 9, 1315-1322). Using new primer sequences, we now select a somewhat more complex isoleucine binding RNA, requiring 4.7 more bits of information to describe. The newly selected structure is a terminal or hairpin loop of 20 nucleotides, 15 being invariant. An information profile shows that the new binding site contains five short functional loop regions joined by less significant single nucleotide positions. Among the important nucleotides is a conserved isoleucine anticodon, supporting the escaped triplet theory, which posits a stereochemical genetic code originating in RNA amino acid binding sites.

Randomized RNA sequences readily fold into a variety of shapes with different functions. These functions include catalysis (ribozymes) and ligand binding (aptamers). Such activities among randomized RNAs supplement naturally existing ribozymes (1), the ribosome (2), and natural aptamers, e.g. riboswitches (3,4) to support the RNA World hypothesis (5), which would require RNA to take functions, which are currently unusual but essential to support the life of a ribocyte.
Among important functions is amino acid binding, which was arguably a crucial activity during the era when encoded polypeptide synthesis (translation) first appeared. Varied aliphatic, aromatic, polar, and basic amino acids are bound by RNA. For example, using in vitro selection or SELEX (6 -8) many artificial amino acid binding RNAs have been recovered including tryptophan (9), arginine (10 -14), valine (15), isoleucine (16), tyrosine (17), phenylalanine (18), histidine, 1 tryptophan, 2 and leucine. 3 In these binding sites, cognate codons and anticodons appear improbably concentrated; a probability of 5.4*10 Ϫ11 on the assumption that affinity and coding sequences are unrelated (19 -21). Accordingly, amino acid binding sites and cognate amino acid triplets are not unrelated but instead strongly co-related. Therefore, we suppose that coding triplets were abstracted from larger ancestral RNAs that contained amino acid binding sites (21).
Aptamer selections can be constructed to favor the simplest molecular solutions that answer a particular selection regime. Such a "simplest" isoleucine-binding RNA has been selected (16,22) using two different sets of primers. Here we report a novel isoleucine-binding RNA selected using different conditions. It is more complex than the simplest motif, 37.9 compared with 33.2 bits of information. Accordingly, it is our next simplest isoleucine binding structure. Notably, like the UAUU motif, which contains an isoleucine anticodon and codon, the new next simplest motif also has an isoleucine anticodon triplet among the functional nucleotides of the amino acid binding site.

MATERIALS AND METHODS
Information Content-Frequencies (F i ) of bases obtained from sequence alignment after doped selection of R431 have been used as estimated nucleotide probabilities (P i ). The final frequencies were normalized and adjusted to the skewed initial nucleotide probability at each position (0.7, 0.1, 0.1, and 0.1). To normalize output frequencies each F i after the selection was divided by the F i before selection. Subsequently, for further calculations, each normalized F i was divided by the sum of the normalized F i s. This method has been adopted from Carothers et al. (23).
The observed Shannon uncertainty, H obs , was calculated for each nucleotide position of each motif using the equation: H obs ϭ Ϫ⌺P i log 2 P i ( i ϭ A, G, C, or U). The calculated H obs was subsequently subtracted from the maximum uncertainty possible H max (24). For random RNA H max is 2 bits/nucleotide position. Sampling error corrections were made for the information content calculations of both motifs.
For the UAUU motif we used an approximate method to calculate the expectation of sampled uncertainty, AE(H nb ) ϭ H max Ϫ (s Ϫ 1)/(2n * ln2), where s is number of symbols (4 for RNA), and n is number of analyzed sequences.
For the newly selected motif we applied an exact method of calculation of the expectation of sampled uncertainty, E nb ϭ ⌺P nb H nb , where P nb ϭ n!/(na! ng! nc! nu!) * P a na P g ng P c nc P u nu , and H nb ϭ Ϫ⌺ (nb/n) log 2 (nb/n), n is total number of bases in a given position in sequence alignment (n ϭ number of A ϩ number of G ϩ number of C ϩ number of U); analogously nb is number of given base, and na, ng, nc, nu, are the number of A, G, C, and U, respectively (24).
For parts of stems that conserve only the ability to pair, shown in Fig.  6 as a black N, information content was computed as follows. There are 16 possible 2-nucleotide combinations at base-paired positions, maximum uncertainty is 4 bits/Watson-Crick base pair. For standard base pairings, uncertainty is 2 bits, so the information content is 2 bits (4 bits Ϫ 2 bits). We take G-U base pairs to also be acceptable, so 6 of 16 possibilities are "pairs." The information content therefore is taken to be 1.5 bits per required base pair. For base pairs that showed preferences in nucleotide combination (colored pictograms in Fig. 6) information content was calculated based on observed frequencies among the 16 possibilities, as for dinucleotide probability.
Isoleucine Selection-Isoleucine affinity chromatography on isoleucine-EAH-Sepharose 4B (containing 10 mM L-isoleucine) has been used previously (16,22) and is also described in detail elsewhere (25). The selection buffer contained 50 mM HEPES (pH 7.0), 300 mM NaCl, 7.5 mM MgCl 2 , and 0.1 mM ZnCl 2 . The elution buffer contained in addition 10 mM L-isoleucine. The column was washed after selection with 15 volumes of wash buffer, 1 M NaCl and 2 mM EDTA. For counterselection, Sepharose blocked with acetyl groups instead of isoleucine * This work was supported by Grant GM 48080 from the National Institutes of Health research and NASA via the Center for Astrobiology NCC2-1052. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18  was used. RNA was folded before applying to the column as follows: 65°C for 3 min in water, adjusted to selection buffer, heating continued at 65°C for next 45 s, and subsequently incubated at room temperature for 10 min. Weakly bound RNA was eluted with 5.5-7-column volumes of column buffer. RNA retained on the column was eluted with 5 volumes of elution buffer (column buffer containing 10 mM L-isoleucine). 5 fractions (2.5-column volume) were collected and carried forward to amplification. New primers were used. 5Ј-TAATACGACTCACTATAGGCAAGT-AGGTAGCGTCGAA(28N)AAAAGAGCAAGTCGGTTAGA-3Ј, where the T7 promoter sequence is underlined. 1.66 pmol (10 12 molecules) of single-stranded DNA was used for the first PCR for isoleucine selection.
Isoleucine Elution Profiles-Internally labeled ([␣ 32 -P]GTP) RNA (ϳ50 pmol) was folded as described for isoleucine selection. RNA was applied to the isoleucine-Sepharose column. The same buffers were used.
Determination of K D -For K D determination the same buffers as described in isoleucine selection were used but Zn 2ϩ increased from 0.1 to 0.4 mM, to compare with previous data (22). Isocratic competitive affinity chromatography has been described previously (11,26).
Interference and Chemical Modification-RNA was modified after folding. DMS or CMCT modifications were carried out based on (27). DMS 4 buffer was as follows: 80 mM HEPES (pH 7.0), 100 mM NaCl, 7.5 mM MgCl 2 , 0.1 mM ZnCl 2 . CMCT buffer was the same but contained borate (pH 8.0) instead of HEPES. Two concentration of isoleucine (2 or 10 mM) were used, with isoleucine-free buffer as a control.

FIG. 1. Selected sequences and aptamer activity.
A, alignment of initially randomized nucleotides with isoleucine motif. B, sequence variation within the hairpin isoleucine binding motif (based on 35 sequences). Completely conserved nucleotide positions are boldface italic. C, test of five isolates including the most frequent R431 for binding and elution on an isoleucine-EAH-Sepharose column.
CMCT as described above. RNA after modification was purified on a P30 column (Bio-Rad). RNA was subsequently applied to isoleucine-EAH-Sepharose column. 1.5 or 2.5 volumes were collected from the void or elution peak, respectively; the same buffers as described for isoleucine selection were used. The primer extension method was used to identify modifications.
Boundary Experiment-5Ј-End labeled ([␥ 32 -P]ATP) R431 was partially hydrolyzed using 50 mM NaOH for 30 s, at 85°C (28). The reaction was quenched by addition of sodium acetate to a final concentration of 320 mM. RNA was precipitated with EtOH, dissolved in water, folded as described under "Materials and Methods" for isoleucine selection, and applied to isoleucine-EAH-Sepharose column. Flow-through and eluted fractions were collected and analyzed on a denaturing polyacrylamide gel.

RESULTS
An RNA with 28 adjacent, initially randomized positions was subjected to isoleucine affinity selection using the previous affinity chromatography protocol (22).
Step by step description of this selection for isoleucine binding aptamers is available (25). An initial RNA pool containing 10 12 unique sequences present in ϳ2400 copies each, on average, was applied to a carboxyl-linked isoleucine-EAH-Sepharose column. The column was washed with 5.5-7-column volumes of buffer. Bound RNA was eluted with 2.5 volumes of the same buffer with 10 mM L-isoleucine. RNA was reverse-transcribed, amplified by PCR, and T7-transcribed to obtain next generation of RNA. Negative selection was included after the first cycle by taking the first 60% emerging from an acetylated control column for use in the subsequent affinity selection.
After the fifth cycle an initial elution peak appeared (11.8% of applied RNA), and elution increased gradually to the seventh cycle (18.5% of applied RNA). The isoleucine-eluted population after seven cycles was cloned, sequenced, and aligned with ClustalW (29). A total of 69 sequences was compared. A new isoleucine motif dominated the selected sequences. The frequency of this aptamer was 50.8% of the sequenced pool. Positional variation suggests at least 10 different parental sequences containing the motif were present in the initial pool (Fig. 1A). The percentage composition for each nucleotide position within the new motif is summarized in Fig. 1B. Binding and elution profiles for five RNAs are shown in Fig. 1C.
For the most frequently appearing sequence (R431) K D is 0.9 mM. This is ϳ2-fold lower than the UAUU motif selected previously (22). In contrast with every other selection for isoleucine binding RNA (16,22), there was only one sequence (1.4%) that fitted the UAUU motif: UACG(N8)CUAUUGGG (conserved nucleotide positions are italic, boldfaced, and underlined). This sequence, however, does not significantly bind the affinity column. Mfold (30) predicts that this sequence does not stably fold the asymmetrical loop of the UAUU motif (not shown). It may therefore be present in active form only a small fraction of the time. We also recovered nonspecific column binding sequences isolated previously. These shared the sequence GUACGCCU (VG2 of Lozupone et al. (22), which was recovered 7 times (not shown).
New Isoleucine Motif-To define the minimal active length of the newly selected RNA we carried out a boundary experiment, randomly truncating 5Ј-labeled RNA with sodium hydroxide (Fig. 2). This procedure delimited a minimal functional RNA ending at nucleotide 61. Based on the observation that the truncated molecule does not require the last 7 nucleotides from the 3Ј-end, it appears that the corresponding part of the initial stem is dispensable; it can be shortened to 9 base pairs with a one-nucleotide bulge (compare Fig. 3C with 5). Accordingly, a predicted smaller active structure was 39 nucleotides (G23-C61; the 3Ј-boundary is marked on the thermodynamically favored structure in Fig. 5). This surmise was confirmed by construction of a functional site spanning the terminal loop, 41-R431, with a stem lengthened by one base pair from the assumed minimal structure. The original pair C22-G62 was converted to G-C (Fig. 3C) to aid T7 RNA polymerase transcription.
Shortening R431 left a functional aptamer. We compared column affinity for minimal structures having both motifs. Profiles of binding of 41-R431 and a shortened version of the UAUU motif (36-UAUU) are similar (Fig. 3A). Secondary structures of both RNAs calculated by mfold (30) are compared in Fig. 3, B and C.
The binding site was more closely defined by chemical evidence. A modification-interference assay with CMCT (U and G) or DMS (A and C) indicates that the tract 34 GGUAUUG 40 and positions A42, A43 and 45 GA 46 are crucial for binding (Fig. 4B). These positions are also marked on the structure shown in Fig. 5.
In addition, protection/enhancement of the accessibility of site nucleotides to DMS and CMCT in the presence of isoleucine ligand was surveyed. CMCT chemical probing shows enhancements at positions U36, U39, G45, G47, and U48 in the presence of 10 mM L-isoleucine (data not shown). Positions showing enhancements are also likely to be close to the site of contact with isoleucine ligand, and are marked in Fig. 5. For the case of DMS probing no significant differences in reactivity were detected at 0, 2, and 10 mM isoleucine. Viewed on the hairpin structure in Fig. 5, the chemically implicated nucleotides are all conserved members of the hairpin loop.
Doped reselection of R431, which has the most frequently recovered sequence, was undertaken to confirm these essential positions. 39 nucleotides spanning the loop were mutagenized (positions G23-C61 in Fig. 5). Starting from 10 15 molecules of single-stranded DNA 3 cycles of selection were applied. The initial amount of DNA should have been enough to cover all possible sequences that are Յ8 mutational changes different from the original one. For comparison, the total number of sequences that are Յ12 mutational changes different from the original one is 2.4*10 15 (31). After the third cycle, isoleucine elution reached 41% of RNA; this pool was cloned and sequenced. Comparison of 61 sequences showed that A50, A51, and A52 are essential elements, because they are perfectly conserved. These three adenosines were supplied as a part of primer complement sequences in the initial selection (Fig.  5, lowercase letters). The rest of the conserved nucleotide positions were as in the initial selection (compare Fig. 1B with 6A). In summary, 15 of 20 positions in the terminal loop are invariant; the consensus sequence of the loop is 33 CGGUAAUGNRAU-GANUHAAA 52 (R ϭ G or A; H ϭ A, C, or U). From these data, we calculated information for each mutagenized nucleotide position (24) as Information Content ϭ H max Ϫ H obs (in bits/nucleotide position), where H max is the maximum Shannon uncertainty possible (2 bits for RNA), and H obs is the observed Shannon uncertainty calculated based on nucleotide frequencies in the alignment of sequences of binding aptamers. Maps of the information content for the newly selected hairpin (Fig. 6A) and the UAUU internal loop motif (Fig.  6B) are shown. The secondary structure used for the new motif was the most probable structure calculated by BayesFold (32) based on the alignment of 54 sequences of equal length obtained from doped selection. The calculation of information content was done based on 61 sequences with initial nucleotide probabilities of 0.7, 0.1, 0.1, and 0.1. The differences in the height of characters within a given nucleotide position were computed by the web script RNA Structure Logo (33) to reflect the calculated information content.
There is evidently high information content in the loop and the first base pair of the stem, where a C-G pair is preferred over other pairs (including G-C). According to our calculations, the rest of the stem is probably required only for loop stabilization. Finally, the loop has five informational regions 33 CG-GUAAUG 40 , 42 RA 43 , 45 GA 46 , 48U, and 50 AAA 52 (possibly "site nucleotides" ; Fig. 6A), which are separated by single much less informatively significant positions (possibly "joints"). The map secondary structure of the UAUU motif was also computed from reselection data (22). The information content for the UAUU motif was based on 72 sequences with initial nucleotide probabilities of 0.25, 0.25, 0.25, and 0.25 (Fig. 6B). The approximate information content for the newly selected binding site including necessary stems is 37.9 bits versus 33.2 bits for the original UAUU internal loop motif.
As does the UAUU motif, the hairpin requires low levels (0.1-0.4 mM) of zinc ions for amino acid affinity (not shown). Finally, with regard to specificity; the hairpin motif does not distinguish isoleucine and leucine (Fig. 7). Even for glycine, some elution is seen, although most of the RNA sticks to the column until the high salt wash (Fig. 7). Leucine and glycine affinity suggest that the site is not highly focused on the details of the isoleucine side chain; certainly it is less side chainspecific than the original UAUU motif (16,22). DISCUSSION We have selected a new aptamer, a hairpin with a 20-nucleotide loop (Figs. 5 and 6A) within which are all the essential nucleotides for an isoleucine binding site. Thus the new motif likely poises loop nucleotides to create a site for the ligand. This site binds L-isoleucine, although with affinity for similar amino acids (Fig. 7). However, considering that this is the second smallest isoleucine site we have selected, we note that the site embraces a conserved isoleucine anticodon sequence, AAU. Therefore, this is another example of a simple RNA aptamer, which contains coding triplets for a bound amino acid, consistent with the escaped triplet theory of the origin of the genetic code (21).
The force of this observation is less than usual, however. We have usually required side chain-specific affinity for cognate amino acids, reasoning that specificity would be essential to a site used for coding; e.g. see Illangasekare and Yarus (18). Side chain specificity usually appears as an unselected by-product of selection on particular amino acid, but this new hairpin is not as selective for isoleucine (Fig. 7), as was the UAUU motif (22). Thus, the new motif might be offered as experimental support for ambiguous primordial coding among chemically similar amino acids (Fig. 7). This is a possibility often advanced on purely theoretical grounds but also is implicit in a stereochemical origin for the code like the escaped triplet rationale.
The newly selected aptamer does not take functional advantage of its somewhat larger size; its K D for isoleucine is only ϳ2-fold lower with 4.7 bits more information, a lesser dependence on size than suggested previously (23). However, individual aptamer structures may be quite idiosyncratic, with the information/function relation visible only in the mean of many examples. The relaxed side chain specificity of the hairpin (Fig.  7) does suggest that a single index like K D may be too narrow  (22) The map secondary structure uses permutation 1 of UAUU motif and was the thermodynamically probable structure for tested aptamers as calculated by mfold (30). a criterion for comparison with molecular information content.
Finally, we consider why the simplest motif was not the most frequent outcome. Smaller RNA motifs are expected to initially be most frequent, because abundance in an unfractionated randomized pool is expected to increase exponentially as the number of required nucleotides decreases.
On the other hand more active or more easily amplified RNA structures can sometimes be selected instead of simpler ones (25) in prolonged selections. For example, it has been suggested (for GTP binding) that 10-fold tighter binding can result from ϳ10 more bits of information (for example, in five completely specified nucleotides), equivalent to ϳ1000-fold decrease in abundance (23). However, this line of reasoning is very unpromising in our case. Firstly, the selection stringency was not set higher than previously (16,22). Selection succeeded at the fifth cycle, and was halted after the seventh cycle, so simple motifs should survive. Previous selections dominated by the UAUU motif were stopped at the sixth or eighth cycles for a 26 or 22 nucleotide random region, respectively (22). Even more convincing, similar previous selections failed to detect the hairpin, which has only about a 2-fold K D advantage. Thus, we must look for other explanations.
Here we have calculated information content to characterize size. In these terms, why is a more complex structure containing ϳ37.9 bits of information (the new hairpin) 35-fold more prevalent than the previously defined simplest 33.2-bit structure (the UAUU internal loop)? The hairpin should be 26-fold (2 37.9 divided by 2 33.2 ) less frequent on the basis of complexity, about a 910-fold (ϭ 35 ϫ 26) disparity from our actual result.
In rationalizing this selection outcome, we suggest two important influences: the accidental inclusion of an AAA sequence in the 3Ј-primer site, later proven to be essential for hairpin activity (Fig. 6A), and use of a short 28-nucleotide random tract.
The presence of the first condition can be considered as a supply of six bits of information, equivalent to a ϳ64-fold increase in abundance favoring the newly selected hairpin. With six bits of information subtracted, only 31.9 additional bits are required for the newly selected motif. This inverts the abundances, making the hairpin loop ϳ2.5-fold (2 33.2 divided by 2 31.9 ) more frequent than the UAUU motif.
The second condition will limit or eliminate the advantage of modularity for the UAUU motif, removing a potential statistical advantage of the smaller motif; the UAUU motif has two modules (34, 35) and the hairpin one. Consequently the UAUU motif will not benefit from its usual statistical advantage; in a larger randomized region the UAUU motif might be the only observed outcome.
Finally, there is a subtle condition whose significance is difficult to judge. Here we used new primer sites. These sequences are not complementary to each other, so they do not supply an initial formed stem that can be captured in selected structures. Because the UAUU motif can benefit in two ways from the initiation of helices, it might be disproportionately penalized when no helix is supplied.
Thus, several experimental features (possibly combined with unseen stochastic factors) produced a new outcome. In summary; although primer design is a little explored aspect of selection planning, it clearly can be decisive and deserves attention. FIG. 7. Specificity test of R431. Internally labeled RNA was applied to the isoleucine-EAH-Sepharose column and eluted with 10 mM amino acid.