Identification and Solution Structure of a Highly Conserved C-terminal Domain within ORF1p Required for Retrotransposition of Long Interspersed Nuclear Element-1*

Long interspersed nuclear element-1 (LINE-1 or L1) retrotransposons comprise a large fraction of the human and mouse genomes. The mobility of these successful elements requires the protein encoded by open reading frame-1 (ORF1p), which binds single-stranded RNA with high affinity and functions as a nucleic acid chaperone. In this report, we have used limited proteolysis, filter binding, and NMR spectroscopy to characterize the global structure of ORF1p and the three-dimensional structure of a highly conserved RNA binding domain. ORF1p contains three structured regions, a coiled-coil domain, a middle domain of unknown function, and a C-terminal domain (CTD). We show that high affinity RNA binding by ORF1p requires the CTD and residues within an amino acid protease-sensitive segment that joins the CTD to the middle domain. Insights in the mechanism of RNA binding were obtained by determining the solution structure of the CTD, which is shown to adopt a novel fold consisting of a three-stranded β sheet that is packed against three α-helices. An RNA binding surface on the CTD has been localized using chemical shift perturbation experiments and is proximal to residues previously shown to be essential for retrotransposition, RNA binding, and chaperone activity. A similar structure and mechanism of RNA binding is expected for all vertebrate long interspersed nuclear element-1 elements, since residues encoding the middle, protease-sensitive segment, and CTD are highly conserved.

L1s belong to a larger group of elements known as non-long terminal repeat retrotransposons that replicate via an RNA intermediate using a target site-primed reverse transcription mechanism (2)(3)(4)(5). In mammals, L1s are 6 -7 kilobases in length and harbor a 5Ј-untranslated region containing an internal promoter, two nonoverlapping open reading frames (ORF1 and ORF2), and a short 3Ј-untranslated region that ends in a poly(A) tail. Both of the element-encoded proteins are essential for retrotransposition (14 -16). The protein encoded by ORF1 (ORF1p) binds single-stranded RNA and DNA and has nucleic acid chaperone activity (15,17). The protein encoded by ORF2 (ORF2p) provides endonuclease and reverse transcriptase functions (14,18,19). The role of ORF2p in retrotransposition is best understood (19,20). It is believed to initiate the target site-primed reverse transcription by introducing a single nick into the target DNA using a conserved endonuclease domain (21). This enables the 3Ј poly(A)-rich tail to anneal to the chromosome and exposes a 3Ј-hydroxyl on the DNA that serves as a primer for ORF2p-mediated reverse transcription using L1 RNA as the template. A double-stranded DNA copy of L1 is then synthesized and joined to the target site through a poorly understood process that may require DNA repair machinery supplied by the host (22).
ORF1p is essential for retrotransposition and has at least two functions (21). First, it binds to L1 RNA during the cytoplasmic phase of the replication cycle and may thereby prevent its degradation. Evidence for this comes from immunochemical staining of embryonal carcinoma cells, which has revealed punctate cytoplasmic structures when probed with anti-ORF1p antibodies (23). These structures presumably contain several ORF1 proteins bound to full-length L1 RNA, since large ribonucleoprotein particles containing both of these molecules can be fractionated from intact F9 (mouse) (24) and HeLa (human) cells (16). The ribonucleoprotein particle probably protects L1 RNA from degradation by cellular nucleases and via the RNA interference pathway (25). It may also facilitate RNA maturation into a retrotransposition-competent structure and/or nuclear import (21). ORF1p may also directly participate in the target site-primed reverse transcription reaction by facilitating nucleic acid strand transfers involved in reverse transcription. This is supported by in vitro studies, which have shown that it is a nucleic acid chaperone; the isolated protein accelerates strand annealing, promotes melting of mismatched duplexes, and in displacement assays facilitates strand exchange (17). Interestingly, the chaperone activity appears to be required for retrotransposition, since single amino acid ORF1p mutants have been identified that disrupt this function and retrotransposition (15) but are otherwise capable of forming ribonucleoprotein particles (16).
The structure of ORF1p from the retrotransposition-competent mouse Tf-type L1 element has been characterized in the greatest detail (42.9 kDa, 371 amino acids) (15). Sedimentation and atomic force microscopy (AFM) studies indicate that it is a dumbbell-shaped homotrimer that is presumably held together by a trimeric coiled-coil (26). Residues at the C-terminal end of ORF1p show the greatest degree of sequence conservation and bind and chaperone nucleic acids (27,28). In this report, limited proteolysis has been used to identify three distinct domains: the coiled-coil, middle, and C-terminal domains. The latter two domains are connected by an ϳ40-amino acid protease-sensitive linker that in combination with the C-terminal domain (CTD) mediates high affinity RNA binding. NMR has been used to determine the solution structure of the CTD and has elucidated a positively charged surface that interacts with RNA.

Proteolysis and Mass Spectrometry Analysis of Spa-type
ORF1p-Full-length ORF1 protein from the mouse spa-type L1 element was purified from baculovirus-infected insect cells as described previously (15). Partial proteolysis was performed by adding sequencing grade trypsin (Sigma) to 10 l of protein (ϳ2 mg/ml) either on ice or at room temperature (buffer: 50 mM phosphate, pH 7.6, 250 mM NaCl, and 0.15 mM EDTA buffer). The reaction was quenched by the addition of either an equal volume of SDS-PAGE loading dye or 2% trifluoroacetic acid. Samples containing loading dye were loaded onto a NuPage 4 -12% Bis-Tris gel (Invitrogen). After separation, the gel was stained with Coomassie Brilliant Blue staining buffer and fixed (29). Bands were excised, and the resulting gel cuts were washed multiple times with 100 mM ammonium bicarbonate and 50% acetonitrile, 50 mM ammonium bicarbonate. Trypsin (10 l of a 20 ng/l solution) was then added to each of the gel cuts for 4 h at 37°C. The gel cut was washed multiple times with 50% acetonitrile, 1% trifluoroacetic acid. The supernatant was collected and dried, using a SpeedVac (Thermo). The dried sample was then re-elevated in 0.1% formic acid. Liquid chromatography (LC)-MS/MS of peptide mixtures was conducted on an LC Packings nano-LC system with a nanoelectrospray interface (Protana) and quadrupole TOF mass spectrometer (QSTAR XL; Applied Biosystems), as described in detail previously (30). Detected ions were then analyzed using the MASCOT data base search engine (Matrix Science) and the Pro ID program (Applied Biosystems) to determine the identity of the peptides. Positive protein identification was based on standard MAS-COT criteria for statistical analysis of LC-MS/MS data. For matrix-assisted laser desorption/ionization (MALDI)-MS measurements, peptides from samples that were quenched with 2% trifluoroacetic acid were desalted using a C4 ZipTip (Millipore). MALDI-TOF (Voyager DE-STR; Applied Biosystems) was then employed to determine the masses of the proteolysis fragments (fragments 1-8) after co-crystallization with matrix (sinapinic acid).
Sample Preparation of Proteins Used in NMR and Proteolysis Studies-Expression plasmids encoding fragments of ORF1p from the Md-A-type L1 element were generated by PCR using L1 Md-A101 as a template (a gift from H. Kazazian; Gen-Bank TM number AY053456). Nucleotide sequences encoding residues 230 -357 and 261-347 were cloned into a pET11a expression vector (Novagen) to produce the ORF1 C-1/3 and ORF1 CTD proteins, respectively. Both expression vectors were sequenced in the forward and reverse direction to confirm their fidelity. The ORF1 C-1/3 and ORF1 CTD proteins were overexpressed and purified in an identical manner from Rosetta-DE3 cells (Novagen). Four-liter cell cultures were grown at 37°C, and induced with 1 mM isopropyl 1-thio-␤-D-galactopyranoside at 30°C for 4 h when the A 600 nm reached ϳ0.7. The cells were then harvested at 8,000 r.p.m. (JA-10), resuspended in lysis buffer (50 mM Tris-HCl, pH 8.0, 1 mM EDTA, 500 mM NaCl, 5 mM dithiothreitol (DTT), 1 mM phenylmethylsulfonyl fluoride, 200 g/ml lysozyme), and sonicated on ice. The lysate was centrifuged at 15,000 rpm for 30 min (JA-20 rotor), and the supernatant was dialyzed into buffer A (50 mM Tris-HCl, pH 8.0, 1 mM EDTA, 150 mM NaCl, 5 mM DTT, and 1 M urea) and applied to a DEAE-Sepharose Fast Flow XK-50 column (Amersham Biosciences). Partially pure protein was eluted using a gradient of 1 M NaCl (0 -100%) in buffer B (50 mM Tris-HCl, pH 8.0, 1 mM EDTA, 2.5 mM DTT). Pooled protein fractions were then added to a SP-Sepharose Fast Flow XK-50 column (Amersham Biosciences) and eluted with a gradient of 1 M NaCl (0 -100%) in buffer B. The protein was then concentrated to ϳ1 ml and loaded onto a gel filtration column (Sephacryl S-100) equilibrated with buffer C (50 mM phosphate, pH 6.5, 200 mM NaCl, 2.0 mM DTT). Eluted protein was then concentrated using a Centricon concentrator (YM-3, Amicon, Inc.) for NMR studies. 15 N-Labeled or 15 N-and 13 C-labeled proteins were produced by growing the cells in minimal media that contained 15 NH 4 Cl or 15 NH 4 Cl and 13 C 6 -labeled glucose, respectively. Protein NMR samples were ϳ1 mM in concentration and contained 50 mM phosphate buffer, pH 6.5, 200 mM NaCl, 5 mM deuterated dithiothreitol, and 7% D 2 O.
Structure Calculations-Structures were calculated using the program X-PLOR-NIH (version 2.14) (53). The final values for the different experimental terms in the target function employed for simulated annealing were 200 kcal mol Ϫ1 rad Ϫ2 for the torsion angle restraints, 30 kcal mol Ϫ1 Å Ϫ2 for the experimental distance restraints (NOEs and hydrogen bonds), 1 kcal mol Ϫ1 Hz Ϫ2 for the 3 J HN␣ couplings, and 0.5 kcal mol Ϫ 1 ppm Ϫ2 for the carbon chemical shifts. Hydrogen bond restraints were introduced in the final stages of refinement and were determined through an analysis of the NOE data. The structures were calculated using hybrid distance geometrysimulated annealing protocols, followed by an additional simulated annealing step on each of the initial distance geometry-simulated annealing structures (54 -56). The figures were prepared using the program MOLMOL and PYMOL (57,58). A total of 200 structures were calculated, of which 110 had no NOE, dihedral angle, or scalar coupling violations greater than 0.5 Å, 5°, or 2 Hz, respectively. 40 conformers with the lowest energies were chosen to represent the structure of ORF1 CTD and have been deposited to the Protein Data Bank (accession code 2JRB).
Nitrocellulose Filter Binding Assay-The filter binding assay was performed as previously described (27). Briefly, ORF1p truncation constructs were incubated with a 76-nucleotide 32 Plabeled RNA molecule for 30 min on ice in binding buffer (20 mM Hepes, pH 7.5, 250 mM NaCl, 1 mM dithiothreitol, 5 mM EDTA, and 100 g/ml bovine serum albumin). The reactions were filtered through nitrocellulose and DE81 and washed with ice-cold binding buffer, and the remaining labeled RNA was quantified by phosphorimaging analysis using ImageQuant software (Amersham Biosciences). The apparent dissociation constant (K D(app) ) was calculated by plotting the total concentration of protein ([P total ]) as a function of the fraction of the RNA that was bound (y). Two models were used to fit the data. Noncooperative binding was assumed by fitting the data to the equation, y ϭ ([P total ]/K D(app) )/(1 ϩ [P total ]/K D(app) ), where [P free ] is approximately equal to [P total ]. Data were also fit to a cooperative binding model using the equation, where h a is the apparent Hill coefficient.

Mouse ORF1p Contains Three Protease-resistant Regions-
The structure of full-length ORF1p from the mouse Tf-type spa L1 element was investigated using limited proteolysis followed by MS. Full-length ORF1p containing a 39-amino acid N-terminal histidine tag was partially proteolyzed with trypsin, and the products were separated by gel electrophoresis (Fig. 1a). The identity of each fragment was then determined using MALDI-TOF MS and LC coupled to tandem MS/MS (Fig. 1b). Immediately after the addition of trypsin at 4°C, the peptide bond following residue Arg-12 is cleaved, leaving a resistant fragment that is missing the histidine tag and 12 N-terminal residues (fragment 2 in lanes 2-6). Further low temperature incubation with trypsin yields six fragments (fragments 3-8 in lanes [3][4][5]. Fragments 3-5 share a common N terminus that starts at residue 45 but differ at their C termini (fragments 3-5 are formed by residues 45-279, 45-247, and 45-238, respectively). Extended digestion reveals that these fragments contain two resistant segments within them, residues 45-169, called the coiled-coil (C-C) domain, and residues 176 -238, called the middle (M) domain. The C-C domain is the most stable portion of ORF1p, since it resists trypsin cleavage at higher temperatures and longer incubation times (fragment 6 in lane 7). Residues 63-130 within it are predicted by the program MULTI-COIL to form a coiled-coil (59). The boundaries of the middle domain (M domain) were defined indirectly, because it was only weakly visible on the SDS-gel after 5 h of incubation at 4°C and because its position was overlapped with fragment 8. The C-terminal boundary of the M domain can be deduced from the sequence of fragment 5, whereas its N terminus was deter-mined by conducting similar digestion studies on a shorter ORF1p polypeptide that encoded only residues 146 -358 (data not shown). A third protected segment is located at the C-terminal end of the protein and is contained in fragments 7 and 8. Both share a common N terminus but differ slightly at their C termini; fragments 7 and 8 correspond to residues 285-368 and 285-360, respectively. The shorter fragment is referred to as the CTD (residues 285-360). The digestion data therefore suggest that residues 1-44 of ORF1p are unstructured. This is followed by an extremely stable coiled-coil-containing segment (residues 45-169) that is joined by a short six-amino acid linker to the M domain (residues 176 -238). Residues within the M domain are predicted to form both sheet and helical structures using the GOR4 secondary structure prediction program (60) and are therefore unlikely to be an extension of the coiled-coil that is more sensitive to proteolysis. The C-C and M domains are then connected by a ϳ40-amino acid linker to the CTD. The long linker is much more susceptible to proteolysis than the linkage between the C-C and M domains, suggesting that it is structurally disordered in the full-length protein.
The C-terminal third of the primary sequence of ORF1p contains determinants required for RNA binding and nucleic acid chaperone activity (61). This region is called C-1/3 and contains the CTD and the protease-sensitive linker identified by limited proteolysis (Fig. 1b). In order to more accurately delineate the boundaries of the CTD and determine if the linker is structurally disordered, NMR was used to study the ORF1 protein from the L1 Md-A element (61). This protein is nearly identical in primary sequence to the spa-type ORF1 protein characterized by MS ( Fig. 2; 83% sequence identity). However, it contains 14 fewer residues in its C-C domain, such that its primary sequence is numbered differently. Hereafter, unless otherwise indicated, the amino acid numbering system used for spa-type ORF1p will also be used when referring to the Md-A protein.
Two polypeptides from Md-A ORF1p were studied by NMR, one containing the entire C-1/3 region (ORF1 C-1/3 ; residues 244 -371) and the other containing only the CTD (ORF1 CTD ; residues 275-361). Initially, triple resonance NMR methods were used to assign the backbone resonances of isotopically 13 C-and 15 N-labeled ORF1 CTD , which exhibits excellent spectral dispersion characteristic of a globular protein (Fig. 3a). The boundaries of the CTD were then further refined by measuring the heteronuclear { 1 H}-15 N NOE values of its backbone amide nitrogen atoms. A plot of these data as a function of residue number is shown in Fig. 3b and reveals that the structured CTD consists of residues Phe 289 -Ile 353 . Since trypsin only probes the accessibility of arginine and lysine residues, not surprisingly, the domain boundaries defined by NMR are slightly smaller than those predicted by proteolysis (fragment 8, residues 285-360). An overlay of the 1 H-15 N HSQC spectra of 15 N-labeled ORF1 CTD (gray) and ORF1 C-1/3 (black) reveals similar chemical shifts for backbone amide atoms in the CTD, indicating that this folded structure is preserved in the longer polypeptide containing the entire C-1/3 region (Fig. 3c). The additional resonances in the spectrum of ORF1 C-1/3 originate from residues 244 -274 and 362-371, which precede and follow the CTD, respectively. These residues are unstructured in ORF1 C-1/3 , since their NMR resonances exhibit narrow line widths and negative heteronuclear { 1 H}-15 N NOE values (data not shown). Combined, the proteolysis and NMR data indicate that the C-1/3 region in ORF1p consists of an autonomously folded CTD that is surrounded by structurally disordered N-and C-terminal polypeptide tails.
The Protease-sensitive Region within C-1/3 Is Involved in RNA Binding-We investigated whether the unstructured polypeptide segments that surround the CTD were involved in RNA binding using a nitrocellulose filter binding assay (62). This work made use of ORF1p from the mouse L1 Md-A element (used in all of the NMR studies) and a 76-nucleotide RNA fragment that has previously been shown to bind full-length ORF1p with high affinity (27). The RNA binding properties of three ORF1p variants were investigated: 1) full-length ORF1p, 2) ORF1 C-1/3 , and 3) ORF1 CTD . Since the stoichiometry of binding to the 76-nucleotide RNA fragment is not known, only apparent dissociation constants were determined. The data were fit to two different binding models: one that assumed At the top of the figure, the ORF1 protein has been represented schematically as containing a sequence predicted to form a coiled-coil by MULTICOIL using a 50% prediction cut-off and a 21-residue window (C-C; residues 63-130) and a C-terminal basic region that has previously been shown to bind RNA (28). The two truncated protein constructs used in the NMR studies are shown directly below: ORF1 C-1/3 and ORF1 CTD . The three protected regions found by partial proteolysis are indicated as white rectangles: the C-C domain, M domain, and CTD. The eight major proteolysis fragments are numbered as indicated in b. Fragment 1 is the full-length protein including the purification tag, fragment 2 contains residues 13-371, fragment 3 contains residues 45-279, fragment 4 contains residues 45-247, fragment 5 contains residues 45-238, fragment 6 contains residues 45-169, fragment 7 contains residues 285-368, and fragment 8 contains residues 285-360 (spa nomenclature). TAG indicates the location of the poly-His tag used to facilitate purification.
binding cooperativity and one that assumed noncooperative binding. Both models yielded similar values for the apparent dissociation constant (K D(app) ). Interestingly, the ORF1 C-1/3 and ORF1 CTD binding data are best fit if cooperativity is assumed, and in each case a similar Hill coefficient (h) of ϳ1.7 is obtained (Fig. 4). Consistent with previous studies, full-length ORF1p binds RNA with a K D(app) of 6.1 Ϯ 0.4 nM (data not shown). In contrast, the ORF1 C-1/3 polypeptide binds RNA ϳ30-fold less tightly; fitting of the ORF1 C-1/3 binding data to cooperative and noncooperative models yields K D(app) values of 181 Ϯ 8 and 180 Ϯ 3 nM, respectively. The reduction in affinity is compatible with a simple avidity effect, since ORF1 C-1/3 is FIGURE 2. Sequence alignment of ORF1 proteins from different species. Sequence alignments were created using ClustalW (66). Identical amino acids are shaded black, and similar residues are shaded gray. The C-C domain (black rectangle), the M domain (white rectangle), and the secondary structure topology of the CTD are shown above the aligned sequences. Single asterisks indicate residues that are Ͻ30% solvent-accessible in the NMR structure. Accession numbers are as follows: Mus musculus ORF1p spa Tf type (from retrotransposon L1 Spa) (AAC53541.1), M. musculus ORF1p Md-A A-type (from retrotransposon L1 Md-A101) (M13002), Homo sapiens (from retrotransposon L1.2) (AAC51278), Rattus norvegicus (S21345), Oryzias latipes (AF055640), Danio rerio (CAD61093), and Fundulus heteroclitus (AAS83199). The first three ORF1 sequences are from retrotransposons previously determined to be active in vivo (14,15,67,68). ORF1 C-1/3 contains residues 244 -371, and ORF1 CTD contains residues 275-361 (spa numbering). The partial proteolysis studies made use of the spa sequence, whereas the NMR structure work used Md-A ORF1p. These proteins share 83% sequence identity over their entirety. monomeric, whereas full-length ORF1p is trimeric (26). Interestingly, ORF1 CTD binds RNA ϳ20-fold less tightly than ORF1 C-1/3 , with noncooperative and cooperative K D(app) values of 4,130 Ϯ 580 and 3,500 Ϯ 420 nM, respectively. This indicates that amino acids within the disordered tails that surround the CTD participate in RNA binding. Additional binding experiments using a protein construct that deleted only the unstructured N-terminal residues from ORF1 C-1/3 revealed that it binds RNA with similar affinity as ORF1 CTD (data not shown). This indicates that amino acids within the N-terminal unstructured tail collaborate with the CTD to mediate high affinity RNA binding.
Solution Structure of the CTD (ORF1 CTD )-NMR was used to solve the structure of the CTD within ORF1p from the L1 Md-A element (ORF1 CTD ; residues 275-361). A total of 1,136 experimentally determined structural restraints (940 NOEs, 120 dihedral angles, 56 hydrogen bonds, and 20 3 J HN␣ scalar couplings) were obtained using standard NMR methods applied to uniformly 15 N-enriched and 15 N-and 13 C-enriched samples of ORF1 CTD . These data were then used as input in simulated annealing calculations to generate an ensemble of 40 conformers that represent the structure. As shown in Fig. 5a, Phe 289 -Ile 353 form a well ordered domain whose backbone and heavy atom coordinates can be superimposed to the mean structure with a root mean square deviation of 0.42 Ϯ 0.09 and 0.93 Ϯ 0.07 Å, respectively. All 40 of the calculated structures possess good covalent geometry and have no NOE, dihedral angle, or scalar coupling violations greater than 0.5 Å, 5°, or 2 Hz, respectively. A summary of the structural and restraint statistics is presented in Table 1.
The structure of ORF1 CTD contains three ␣ helices that are packed against a three-stranded ␤ sheet (␣␤␤␤␣␣ secondary  The figure shows the results of a filter binding assay used to measure the binding of ORF1 CTD (circles) and ORF1 C-1/3 (triangles) to a 76-mer RNA molecule. The data were fit to binding models that assumed 1:1 binding stoichiometry and either noncooperative (dashed line) or cooperative binding (solid line). Data fitting assuming noncooperative binding yielded an apparent dissociation constant (K D(app) ) of 4,130 Ϯ 580 and 181 Ϯ 8 nM for ORF1 CTD and ORF1 C-1/3 , respectively. Fits to a cooperative binding model yielded K D(app) values of 3,500 Ϯ 420 and 180 Ϯ 3 nM for ORF1 CTD and ORF1 C-1/3 , respectively. Hill coefficients (h) of 1.7 and 1.8 were obtained for ORF1 CTD and ORF1 C-1/3 , respectively. Fitting assuming noncooperative and cooperative binding reproduced the experimental data with R 2 values of Ͼ0.97 and Ͼ0.99, respectively. structure topology) (Fig. 5b). A DALI analysis reveals that this is a novel RNA binding fold, since it is related to only one protein structure with a Z-score greater than 2.0, the hypothetical Escherichia coli protein YkfF (Protein Data Bank code 2HJJ) that has no known function and is unrelated to ORF1p at the primary sequence level. The structure is initiated by residues Thr 293 -Leu 307 , which form a long amphipathic ␣-helix (helix ␣1) that leads into a three-stranded anti-parallel ␤-sheet that is constructed from contiguous residues (strand ␤1, residues Arg 315 -Leu 317 ; strand ␤2, residues Lys 321 -Ile 324 ; and strand ␤3, resi-dues Lys 331 -Phe 333 ). Two ␣ helices positioned at an angle of ϳ110°, ␣2 (residues Lys 336 -Ser 344 ) and ␣3 (residues Pro 347 -Ile 352 ), complete the fold and are separated by a tworesidue turn. All of the helices are packed onto the same side of the sheet, with the long axis of helix ␣3 positioned perpendicular to the strands of the sheet and the axes of helices ␣1 and ␣2 positioned nearly parallel to strands ␤1 and ␤2, respectively. Each of the secondary structural elements contributes residues to the hydrophobic core of the protein (Fig. 5c). Residues located in the hydrophobic core are the following: Ala 296 , Ala 299 , Trp 300 , Val 303 , and Leu 307 from ␣1; Leu 322 and Ile 324 from the ␤1 strand; Ile 326 from the loop between strands ␤1 and ␤2; Phe 333 from ␤2; Phe 339 , Tyr 342 , and Leu 343 from ␣2; and Leu 349 from helix ␣3. As compared with the CTD from the spa element discovered by limited proteolysis (Fig. 1a), the primary sequence of Md-A ORF1 CTD differs at only four positions (Fig. 2). None of these differences are expected to alter the structure of the CTD, since residues Leu 312 (in the ␣1 and ␤1 loop), Ile 325 (in loop ␤2 and ␤3 loop), Asp 327 (in the ␤2 and ␤3 loop), and His 341 (in helix ␣2) in Md-A ORF1 CTD studied by NMR are solvent-exposed or conservatively replaced in spa ORF1 CTD .
NMR Localization of a RNA Binding Surface on ORF1p-In order to localize the RNA binding surface on ORF1 C-1/3 , we performed a series of NMR titration experiments using 15 N-labeled ORF1 C-1/3 and three different RNA oligonucleotides: 1) a 43-mer RNA molecule from the L1 genome that binds the full-length mouse ORF1p (27), 2) a single stranded 9-mer RNA molecule, and 3) a 32-mer RNA molecule that forms a hairpin and thus contains both double-stranded RNA and an exposed tetraloop (63). The choice of the RNA molecules is somewhat arbitrary, since ORF1p presumably binds RNA nonsequence specifically (27). The 43-mer was chosen because it had previously been shown to bind ORF1p (27). The 9-mer RNA molecule was chosen because ORF1p preferentially interacts with single-stranded oligonucleotides, and based on the structure of the CTD, it should be long enough to fully contact any potential binding surface on the protein. Finally, the 32-mer hairpin was chosen because it contains both double-and single-stranded features, thereby increasing the chances of it favorably interacting with the protein. In separate experiments, each of the RNA molecules were added to the protein, and 1 H-15 N HSQC spectra were recorded. The addition of the 9-mer single-stranded RNA molecule to a 200 M solution of ORF1 C-1/3 dissolved in low salt buffer (50 mM PO 4 , pH 6.5) caused negligible spectral changes. ORF1 C-1/3 binds this oligonucleotide very weakly, since no spectral changes occur even when the 9-mer RNA is present at a 10-fold molar excess (data not shown). In contrast, the addition of the longer 32-and 43-mer RNA fragments to the protein dissolved in low salt buffer results in immediate precipitation, indicating that ORF1 C-1/3 exhibits some degree of length or sequence specificity in its binding.
In order to obtain samples suitable for NMR analysis, protein-RNA complexes containing the longer 32-and 43-mer RNA molecules were first formed in high salt buffer (50 mM phosphate, pH 6.5, 2.5 mM DTT, 1 M NaCl), since under these conditions, the components can be mixed together without immediately precipitating. Each complex was then dialyzed into low salt buffer and concentrated for NMR analysis. For each RNA molecule, five RNA complexes were studied that differed in the ratio of RNA to protein (RNA/protein 1:4, 1:2, 1:1, 2:1, and 4:1). Although the RNA-ORF1 C-1/3 complexes containing the 32-and 43-mer molecules were soluble, their 1 H-15 N HSQC spectra exhibited extensive resonance line broadening at all RNA/protein ratios. Resonances arising from atoms in both the CTD and the disordered N-terminal linker disappeared completely, suggesting that both elements in ORF1 C-1/3 are in contact with RNA, and/or the complex forms a soluble higher order aggregate.
Interpretable chemical shift perturbation data were obtained when ORF1 CTD and the 32-mer RNA molecule were studied. ORF1 CTD removes the amino-terminal tail and in complex with the 32-mer yields 1 H-15 N HSQC spectra that exhibit selective chemical shift perturbations and line broadening. Fig. 6a shows residues exhibiting composite chemical shift (⌬␦ total ) changes greater than 0.12 mapped onto the structure of ORF1 CTD . This analysis reveals a continuous surface that is composed of residues in helix ␣1, strand ␤1, and residues in the turns connecting strands ␤1 to ␤2 and ␤3 to helix ␣2. An inspection of the electrostatic surface of ORF1 CTD reveals that these residues form a positively charged surface that is well suited for RNA binding (Fig. 6b). The distinct behavior of ORF1 C-1/3 and ORF1 CTD upon titration of the 32-mer RNA is compatible with the stronger affinity of ORF1 C-1/3 for RNA measured by filter binding (Fig. 4) and further supports the notion that the N-terminal flexible tail collaborates with the CTD for optimal RNA binding.

DISCUSSION
L1 retrotransposons have successfully colonized a large fraction of the human and mouse genomes and require L1-encoded ORF1p to retrotranspose (2)(3)(4)(5)21). In this study, we investigated the structure and RNA binding properties of mouse ORF1p, which has been characterized in the greatest detail and shares significant sequence identity with human ORF1p. Previous sedimentation and AFM studies have revealed that it adopts a dumbbell-shaped homotrimeric structure in which The notation of the NMR structures is as follows: ͗SA͘ are the final 40 simulated annealing structures; (SA) r is the average energy-minimized structure. The number of terms for each restraint is given in parentheses. a None of the structures exhibited distance violations greater than 0.5 Å, dihedral angle violations greater than 5°, or coupling constant violations greater than 2 Hz. b Two distance restraints were employed for each hydrogen bond (r NH⅐⅐⅐O Ͻ 2.5 Å and r N⅐⅐⅐O Ͻ 3.5 Å). c The experimental dihedral angle restraints were as follows: 53 , 53 , 11 1 , and 4 2 angular restraints. the polypeptides are arranged in parallel (26,64). The handle of the dumbbell in the AFM image is 15.6 nm in length, whereas the lobes differ in size and have diameters of ϳ11 and ϳ4.5 nm (26). The domain structure deduced from our limited proteolysis study is compatible with this model and suggests which segments within ORF1p are likely to contribute to different parts of this structure (Fig. 7a). Mouse ORF1p contains at least three structured regions: 1) a segment that contains a predicted coiled-coil (C-C domain), 2) an acidic region in the middle of the primary sequence we refer to as the M domain, and 3) a CTD. Presumably, the handle of the dumbbell is formed by C-C domain, which is the most trypsin-resistant fragment in

͗SA͘
ORF1p. An analysis of its primary sequence using the program MULTICOIL predicts that residues 63-130 form a coiled-coil with a confidence level that exceeds 50% with a 21-residue prediction window. This is compatible with studies that have shown that this region contains determinants required for oli-gomerization (26,28,64). At the Nand C-terminal ends of the C-C fragment, 18 and 39 amino acids surround the predicted coiled-coil, respectively. Although these residues are predicted to form helices, there is a significantly lower probability that they form a coiled-coil. If all of the residues in the C-C protease-resistant fragment form a coiled-coil, its predicted length would be 18.8 nm (assuming a rise of 1.5 Å per residue (65)), which is slightly larger than the value measured from the AFM image. The small lobe probably contains residues within the protease-sensitive N-terminal tail (residues 1-45), whereas the large lobe in the AFM image is presumably formed by the M domain and the CTD. An inspection of the primary sequences of ORF1 proteins from vertebrate L1 elements reveals a high degree of sequence conservation in residues that encode the M domain, linker, and CTD (Fig. 2). In contrast, residues within the C-C domain show little sequence conservation and in human ORF1p have been predicted to form a leucine zipper (21). The C-terminal third of ORF1p binds RNA and accelerates annealing of complementary oligonucleotides (61). The protease cleavage data indicate that this region is formed by the CTD and a ϳ40amino acid protease-sensitive linker that connects the CTD to the M domain. Four lines of evidence indicate that both the linker and the CTD collaborate in RNA binding. First, we have shown that the isolated CTD binds RNA ϳ20-fold less tightly than a polypeptide containing the entire C-1/3 region (Fig. 4). Second, amino acid mutations within both of these elements disrupt retrotransposition and ribonucleoprotein particle formation in vivo (15,16). Third, our NMR titration experiments indicate that the C-1/3 region binds RNA more tightly than the isolated CTD. Finally, residues within the linker and CTD are highly conserved, compatible with an important role in RNA binding (Fig. 2).
Since the linker is highly sensitive to digestion in the absence of RNA, ORF1p may adopt a "hydralike" structure in which the CTD RNA binding module in each chain of the trimer is connected to the M-domain via a flexible linker (Fig. 7a). This model is compatible with sedimentation studies of isolated ORF1 C-1/3 that have shown that it does not form a hydrated compact sphere (64) and our NMR experiments that indicate from the mean change (0.12 Ͻ ⌬␦ total Յ 0.62) and are Glu 292 , Ala 320 , Arg 315 , Leu 317 , and Asp 335 . Also shown are residues with chemical shift perturbations that could not be determined either because they are significantly broadened or experience chemical shift perturbations in excess of 0.3 ppm (Phe 289 , Met 294 , Arg 298 , Gln 305 , and Arg 308 ). b, the electrostatic surface potential diagram of ORF1 CTD . Basic and acidic regions are colored in blue and red, respectively. Hydrophobic regions are colored white. c, surface structure of ORF1 CTD showing the positioning of residues required for in vivo retrotransposition mutants (red). ORF1p mutants harboring changes in Arg 297 and Arg 298 as well as residues Tyr 318 -Pro 319 -Ala 320 -Lys 321 -Ser 323 are unable to promote retrotransposition in vivo (69). In all panels, two images are shown that differ by a rotation of 180°about the vertical axis.
that the linker is unstructured in ORF1 C-1/3 . However, it is also conceivable that in the context of the full-length trimeric protein, the linker is structured as a result of inter-protomer interactions. Cleavage of one or more exposed sites within the structured linker could destabilize it sufficiently to lead to its complete and rapid digestion under the experimental conditions we have used.
Interestingly, ORF1 C-1/3 and ORF1 CTD bind RNA with a modest degree of cooperativity, as evidenced by Hill coefficients of ϳ1.7 (Fig. 4). This may be a result of protein-protein interactions that facilitate binding and could represent residual interactions between the CTDs that are also present in the intact homotrimer. It is also possible that binding of ORF1p promotes additional protein binding by altering the structure of RNA, which would be compatible with the demonstrated nuclear chaperone activity of ORF1p (15). In the future, it will be important to further explore this issue by determining the stoichiometry and nucleotide sequence and length dependence of protein binding. NMR and binding studies that progressively shorten the 76-mer RNA molecule that we have shown binds ORF1 C-1/3 with high affinity may also lead to the identification of RNA subdomains whose protein complexes are better suited for structural analysis.
NMR studies indicate that the CTD is a structurally unique binding module that may interact with single-stranded oligonucleotides like type-1 KH domains. Although they are unrelated at the primary sequence level and have a distinct arrangement of secondary structural elements, the CTD and type-1 KH domains adopt superficially similar structures in which three ␣ helices are packed against a three-stranded antiparallel ␤ sheet (Fig. 7b). Interestingly, both proteins may preferentially bind single-stranded RNA using a structurally similar cleft. In the structure of the protein-RNA complex containing the Nova type-1 KH domain, a single-stranded RNA molecule binds to a cleft formed by two tandem arranged ␣ helices (␣1 and ␣2) and ␤ strand ␤2 (Fig. 7b,  right). The chemical shift mapping data of ORF1p are compatible with a similar general mode of binding in which the RNA contacts the cleft formed by residues in helix ␣1, strand ␤2, and the ␤1-␤2 loops. Attractively, the location of this binding surface is consistent with the demonstrated importance of the linker, which immediately precedes helix ␣1 in the primary sequence and is therefore well positioned to make additional contacts to RNA. However, the chemical shift mapping data should be interpreted with caution, since only the addition of the 32-mer hairpin molecule caused interpretable changes in the NMR spectrum. These changes are minor and presumably result from weak interactions with nucleotides in the hairpin portion of the RNA. In contrast, the 43-mer appears to bind the protein with high affinity and exhibits uniform line broadening, suggesting that multiple proteins bind, and/or a well defined complex is not formed. It is unclear why minimal spectral changes are observed when the 9-mer is added to ORF1 C-1/3 , but it is possible that it is not long enough to effectively bind the protein.
The structure of the CTD and its postulated RNA binding mode are compatible with mutagenesis studies of the mouse and human ORF1 proteins. This work has shown that Arg 297 and Arg 298 (spa numbering) play an important role in retrotransposition (14 -16). Nonconservative substitution of the diarginine peptide with Ala-Ala in the mouse or human proteins impairs RNA binding and ribonucleoprotein particle formation as well as in vivo retrotransposition. These residues also play an important role in the nucleic acid chaperone activity of ORF1p, since mutants containing more conservative alterations (Lys-Arg, or Lys-Lys) retain RNA binding affinity but, unlike the wild-type protein, have lost their nucleic acid chaperone activity and are unable to support retrotransposition (14 -16). In the structure, the side chains of Arg 297 and Arg 298 are located at the end of helix ␣1 and are well positioned to interact with nucleic acids, since they are proximal to the interaction surface identified by NMR (Fig. 6b). The structure of CTD also sheds light onto the effects of more radical mutations in the CTD. Mutants of human ORF1p that replace residues Tyr 318 -Pro 319 -Ala 320 -Lys 321 -Leu 322 -Ser 323 (L1 spa numbering) with Ala-Ala-Ala-Ala-Leu-Ala are inactive in vivo and incapable of forming ribonucleoprotein complexes. These residues are completely conserved in mouse ORF1 CTD and are located in strand ␤2 and the loop that connects strands ␤1 and ␤2. It seems likely that replacement of Pro 319 with alanine will disrupt  (69). The proteins adopt generally similar structures, although they are unrelated at the primary sequence level and their secondary structural elements are arranged differently (type-1 KH domains have a ␤1-␣1-␣2-␤2-␤3-␣3 configuration as compared with ORF1 CTD with a ␣1-␤1-␤2-␤3-␣2-␣3 arrangement). In the KH domain-RNA complex, a cleft formed by the sheet and the ␣-helix interacts with RNA (right). A similar cleft exists in the structure of ORF1 CTD , and the chemical shifts of residues within it are significantly perturbed when RNA is added (left, residues most affected by the addition of RNA are colored yellow). The protein and the RNA in the KH domain-RNA complex are colored pink and gray, respectively (Protein Data Bank code 1EC6).
the structure of the CTD by destabilizing the turn that connects these ␤ strands. In the closely related human protein, a fouramino acid substitution in the linker preceding the CTD impairs retrotransposition and ribonucleoprotein particle formation (Arg 271 -Glu 272 -Lys 273 -Gly 274 to Ala-Ala-Ala-Ala) (14). Based on our results, it seems likely that at least part of the deleterious effect of this mutation results from a reduction in RNA binding affinity.
In conclusion, we have shown that the most highly conserved region in vertebrate ORF1 proteins consists of two domains that are connected by a protease-sensitive linker. Both the linker and a CTD are required for high affinity binding to RNA. The CTD adopts a unique three-dimensional structure that, based on chemical shift mapping experiments, interacts with RNA via a cleft formed by the N-terminal ␣-helix and the threestranded ␤ sheet. Future studies will need to determine if the M domain contributes to binding, whether the linker is structurally disordered in the full-length protein, and how these motifs work in concert in the full-length trimer protein to interact with and chaperone nucleic acids.