Rapid Protein Sequencing by Tandem Mass Spectrometry and cDNA Cloning of p20-CGGBP

The autonomous expansion of the unstable 5′-d(CGG) n -3′ repeat in the 5′-untranslated region of the humanFMR1 gene leads to the fragile X syndrome, one of the most frequent causes of mental retardation in human males. We have recently described the isolation of a protein p20-CGGBP that binds sequence-specifically to the double-stranded trinucleotide repeat 5′-d(CGG)-3′ (Deissler, H., Behn-Krappa, A., and Doerfler, W. (1996)J. Biol. Chem. 271, 4327–4334). We demonstrate now that the p20-CGGBP can also bind to an interrupted repeat sequence. Peptide sequence tags of p20-CGGBP obtained by nanoelectrospray mass spectrometry were screened against an expressed sequence tag data base, retrieving a clone that contained the full-length coding sequence for p20-CGGBP. A bacterially expressed fusion protein p20-CGGBP-6xHis exhibits a binding pattern to the double-stranded 5′-d(CGG) n -3′ repeat similar to that of the authentic p20-CGGBP. This novel protein lacks any overall homology to other known proteins but carries a putative nuclear localization signal. The p20-CGGBP gene is conserved among mammals but shows no homology to non-vertebrate species. The gene encoding the sequence for the new protein has been mapped to human chromosome 3.

DNA sequences containing trinucleotide repeats of the general sequence (CXG) n or (GAA) n appear to be genetically unstable. Amplifications of such repeats were found to be associated with several human diseases, e.g. fragile X (FraX) 1 syndrome, Huntington's disease, myotonic dystrophy, or Friedreich's ataxia (for reviews see Refs. 1 and 2). The number of repeats usually correlates with the severity of the disease.
The FraX syndrome is associated with an amplification of the trinucleotide sequence 5Ј-d(CGG) n -3Ј in the 5Ј-UTR of the FMR1 (fragile X mental retardation) gene and extensive 5Ј-d(CG)-3Ј methylation in the repeat as well as in adjacent regions (3)(4)(5)(6)(7). As a consequence, the expression of the encoded protein FMRP is reduced or abolished. Fragile sites similar to this locus have been identified on several of the human chromosomes, and all sites characterized so far have shown an expansion of 5Ј-d(CGG) n -3Ј repeats suggesting an important structural function of this DNA sequence. Several studies have demonstrated that single-stranded oligodeoxyribonucleotides of the general sequence 5Ј-d(CXG) n -3Ј with X ϭ A or G can form stable secondary structures in vitro (8 -14). The mechanism of the amplification of unstable DNA sequences is not understood. Expansion might involve DNA slippage or unequal crossing over, but these models do not explain the observed phenomenon sufficiently (for review see Ref. 15). The function of the repeat itself is also unknown. A 5Ј-d(CGG) 4 -3Ј fragment in the rRNA gene promoter (16) and a 5Ј-d(CTG) 25 -3Ј element located in the promoter of the mouse growth inhibitory factor gene (17) are thought to function as regulatory elements.
Genetic instability of long 5Ј-d(CXG) n -3Ј tracts with deletion products of triplet repeat sequences has also been observed in Escherichia coli (18). The instability of long 5Ј-d(CGG) n -3Ј repeats in E. coli has been shown to depend on host cell genotype, length, polymorphism, and on the orientation of the triplet repeat relative to the replication origin (19). Expansion products of triplet repeat sequences have also been detected (20).
We are interested in the characterization of human cellular proteins binding to 5Ј-d(CGG) n -3Ј repeats in a sequence-specific manner. The study of these proteins may improve our understanding of triplet repeat amplifications and their function. Double-stranded as well as single-stranded simple trinucleotide repeat sequences are target sequences for sequencespecific DNA binding proteins (21)(22)(23). We have recently isolated the 20-kDa protein p20-CGGBP (5Ј-d(CGG) n -3Ј binding protein) from HeLa nuclear extracts by DNA affinity chromatography. This protein binds sequence-specifically to the unstable, double-stranded 5Ј-d(CGG) n -3Ј trinucleotide repeat (24). The protein requires more than eight repeats for proper binding. Base pair exchanges in the 1st base pair of every second triplet repeat abolish binding. None of the other known unstable trinucleotide repeats or either single strand of the trinucleotide repeat 5Ј-d(CGG) n -3Ј can serve as a target sequence for this protein. The binding of p20-CGGBP is also severely inhibited by complete or partial cytosine-specific DNA methylation of the binding motif. Interestingly, a p20-CGGBP activity has been found in several human and mammalian cell lines and in human primary lymphocytes.
Here we describe the rapid cloning of the full-length cDNA for p20-CGGBP by a novel strategy. Peptide sequence tags obtained by nanoelectrospray mass spectrometry (25) of the purified protein have been used to screen an EST data base. The encoded protein has been expressed in bacteria, and its binding specificity has been investigated in detail.

EXPERIMENTAL PROCEDURES
Cell Lines-Human HeLa cells were purchased from the Gesellschaft fü r Biotechnologische Forschung, Braunschweig, Germany.
Purification of p20-CGGBP from HeLa Nuclear Extracts-The 5Јd(CGG) n -3Ј binding protein p20-CGGBP was purified from HeLa nuclear extracts by anion exchange chromatography and DNA affinity chromatography as described (24) with slight modifications. A mixture of double-stranded (ds) and single-stranded calf thymus DNA-cellulose (Pharmacia Biotech Inc.) was used as an unspecific DNA matrix. As a specific DNA matrix 5Ј-d(CGG) 17 -3Јds-Sepharose was used. Each fraction was tested for its ability to bind sequence-specifically to the oligodeoxyribonucleotide 5Ј-d(CGG) 17 -3Јds as described. The sequences of various oligodeoxyribonucleotides used in this study were summarized in Table I. Protein fractions prepared for peptide sequencing were eluted from 5Ј-d(CGG) 17 -3Јds-Sepharose at 65°C for 10 min in 5 mM Tris-HCl, 1% SDS, 2 mM dithiothreitol, 10 g/ml aprotinin, pH 6.9, separated by SDS-polyacrylamide gel electrophoresis and transferred to a polyvinylidene difluoride (PVDF) membrane (Problot, Applied Biosystems) in 10 mM CAPS, 10% methanol, pH 11. After staining of the membrane in Coomassie Blue R-250, the appropriate band was excised and stored at Ϫ20°C (26).
Southwestern Blotting-Protein fractions isolated from HeLa nuclear extracts and enriched for the 5Ј-d(CGG) 17 -3Јds binding activity as well as fractions without such activity were separated by SDS-polyacrylamide gel electrophoresis. After transfer to a PVDF membrane (Fluorotrans, Pall, Dreieich, Germany) in 195 mM glycine, 25 mM Tris, pH 8.3, the membrane was washed in buffer SW (10 mM K-HEPES, 100 mM KCl, 0.5 mM dithiothreitol, 0.1 mM EDTA, 0.2 mM spermine, 10% glycerol, protease inhibitors, pH 7.9). All subsequent steps were carried out at 4°C unless stated otherwise. Proteins were denatured by washing the membrane twice for 10 min in buffer SWA (10 mM K-HEPES, 100 mM KCl, 0.5 mM dithiothreitol, 0.1 mM EDTA, 10% glycerol, 6 M guanidinium hydrochloride, pH 7.9) and were renatured by removing half of the volume of buffer SWA, replacing it with buffer SW, and incubating the membrane for 10 min. This step was repeated four times and followed by a 5-min incubation in buffer SW. The membrane was blocked in buffer SW containing 5% low fat milk powder, 10 g/ml salmon sperm DNA, and 1 g/ml poly(dA⅐dT) for 60 min. After washing the membrane in buffer SW with 0.5% low fat milk powder, the membrane was incubated for 2 h at 18°C in buffer SW with 60 fmol of 32 P-labeled oligodeoxyribonucleotide (1.7 ϫ 10 6 cpm) and 4 g/ml poly(dA⅐dT). The membrane was finally washed in buffer SW for 10 min, dried,, and exposed for 2-4 days on Kodak XAR films.
Determination of the Amino Acid Sequence of p20-CGGBP and Identification of an EST Clone Coding for p20-CGGBP-The PVDF membrane carrying the blotted protein was destained in water. The protein  17 ds or (CAG) 17 ds as 32 P-labeled binding probes. Fraction III was isolated after the first step of specific DNA-affinity chromatography, and fraction IV was isolated after the second step. Both fractions showed high binding activity to the oligodeoxyribonucleotide (CGG) 17 ds in EMSAs (24). A specific band at 20 kDa was detected in fractions III and IV with the oligodeoxyribonucleotide (CGG) 17 ds as binding probe. The signal at 100 kDa was unspecific.
FIG. 2. Strategy for the determination of the amino acid sequence of p20-CGGBP. After gel separation and digestion of the protein, the peptides in the unseparated mixture were sequenced by nanoelectrospray tandem mass spectrometry (25). Peptide sequence tags consisting of short amino acid sequences with their mass locations in the peptides were used to search EST data bases in an error-tolerant way to compensate for possible sequence mistakes in the EST entries. When a clone was identified, it was obtained from a generally available repository (35), and the full insert length was sequenced. The higher sequencing accuracy and the longer sequence then allowed more of the peptide data to be matched against the DNA sequence. This approach further verified the identity of the clone. When the clone contained the full-length coding sequence, it could be immediately subcloned for expression. Alternatively, different EST sequences and clones could be assembled for a full-length clone, or the matching EST itself could be used as a probe for further library screening. was reduced, alkylated, and digested overnight with trypsin (12.5 ng/l; Boehringer Mannheim, sequencing grade) at 37°C in a 50 mM ammonium bicarbonate, 5 mM CaCl 2 buffer (25,27). Peptides were extracted from the membrane in 10 l of aqueous 5% formic acid (Merck, Darmstadt, Germany) and subsequently in 10 l of 1:1 acetonitrile:water, 5% formic acid (two changes). The resulting peptide mixture was dried and stored at Ϫ20°C. The solution was reconstituted in 1 l of 70% formic acid, rapidly diluted with 9 l of water to avoid formylation, and was concentrated and desalted using 50 nl of Poros R2 material (Perspective Biosystems, Framingham, MA) prepared in a glass capillary as described previously (25,27). The peptide mixture was eluted directly into a gold-coated glass capillary by passing 2 volumes of 0.4 l of 50% methanol, 5% formic acid over the Poros column. The gold-coated glass capillary was mounted in the nanoelectrospray ion source (28,29) on the mass spectrometer for peptide tandem mass spectrometry (MS) sequencing.
Tandem MS investigations were performed on an API III triple quadrupole mass spectrometer (PE Sciex, Ontario, Canada) equipped with an updated collision cell (30) and a nanoelectrospray ion source (28,29). To detect peptides, which were below the chemical noise level, the parent ion scan technique, a filtering method for triple quadruple mass spectrometers, was used on the isoleucine/leucine immonium ion (31) (see Fig. 3). Only peptides generating upon fragmentation a predefined fragment ion, in this instance the 86-Da immonium ion of isoleucine/leucine, were detected. By applying this filtering procedure, even peptides generating signals below the noise level in the spectrum could still be recognized.
Data base searches were performed using the sequence tag approach and the PeptideSearch program (32). Sequence tags present information on part of a peptide sequence and typically comprise an internal sequence stretch of about three to four amino acids and the mass location in a peptide of known total mass. Frequently, this information can be gained from a tandem mass spectrum and facilitates the highly specific identification of the underlying complete peptide sequence during a data base search. A non-redundant data base currently containing more than 200,000 protein sequences or open reading frames was searched. Another data base was prepared from the available EST data base dBEST (33,34) which currently contains about 500,000 human sequence entries. The EST clone ID269133 (Genbank accession numbers N24697 and N36676) was obtained via the I.M.A.G.E. consortium (35), distributed by Research Genetics, Huntsville, AL, and resequenced on automated DNA sequencers.
Expression of the Fusion Protein p20-CGGBP-6xHis in E. coli and Purification by Ni-Chelate Chromatography-A fragment p20 -518 was amplified from the plasmid 269133 by using primers p20Bam5Ј (5Ј-dCGCGGATCCGAGCGATTTGTAGTAACAGCA-3Ј) and p20Kpn3Ј (5Ј-dGGGGTACCTCAACAATCTTGTGAGTTGAG-3Ј) using Pfu DNA polymerase (Stratagene). The fragment p20 -518 contained a BamHI recognition site at the 5Ј-end, a site for KpnI at the 3Ј-end, and coded for amino acids (aa) 2-166 of p20-CGGBP. This fragment was cloned into the vector pQE40 (Qiagen, Hilden, Germany) and cut with BamHI and KpnI. Thereby, a histidine hexamer was introduced at the N terminus of the fusion protein (see below). The correct sequence of the fusion construct pQE40-518 was ascertained by DNA sequencing of 10 independent clones after transformation into E. coli M15[pREP4] (Qiagen). The vector pQE40 encoded the fusion protein 6xHis-dihydrofolate reductase (DHFR-6xHis). Its coding sequence was removed by cleavage with the above-mentioned enzymes.
The E. coli strains containing the appropriate constructs were grown FIG. 3. Determination of the sequence of tryptic peptides from p20-CGGBP with nanoelectrospray tandem mass spectrometry. After extraction from the PVDF membrane, desalting, and concentration over Poros R2 material, one peptide (peptide a) could be detected in the mass spectrum of the sample. Two additional peptides (peptides b and c) were identified by using the parent ion scan on 86, the immonium ion of isoleucine or leucine (2nd panel from top). This scanning technique was used to detect peptides below the chemical noise level in the original spectrum. All peptides, which fragment to produce the mass 86 Da, could be detected in this scan and thus be distinguished from chemical background ions. Peptide a did not yield a signal in the parent ion scan because it did not contain isoleucine or leucine (see below). A complete y ion series could be retrieved from the tandem MS spectrum of peptide a (39) resulting in the amino acid sequence FVVTAPPAR. The amino acid sequences of peptides a, b, and c (printed in bold) obtained by tandem mass spectrometry and their locations in the total protein are shown in the bottom part of this figure. A nuclear localization signal (aa 69 -84) was detected by computer-aided sequence analyses.
in liquid culture at 37°C to an A 600 nm of 0.7. Synthesis of the fusion protein was induced by adjusting the medium to 1.5 mM isopropyl-1thio-␤-D-galactopyranoside (MBI Fermentas). After incubation for another 2 h, bacteria were pelleted and lysed for 30 min on ice in buffer SB (50 mM sodium phosphate, 10 mM Tris-HCl, 300 mM NaCl, 1 mM MgCl 2 , 0.1% Tween 20, 20% glycerol, 8 g/ml aprotinin, pH 7.85) in the presence of 1 mg/ml lysozyme (Calbiochem). The lysate was sonicated on ice for 1 min at 40 watts (Sonifier B12, Branson Sonic Power Company, Danbury, CT). The cytoplasmic extract was collected after centrifugation (10,000 ϫ g, 4°C, 10 min), frozen in liquid nitrogen, and stored at Ϫ80°C.
The fusion protein p20-CGGBP-6xHis and the control protein DHFR-6xHis were purified by Ni 2ϩ -chelate affinity chromatography from bacteria expressing the constructs pQE40 -518 and pQE40, respectively. Proteins from cytoplasmic extracts were bound to Ni 2ϩ -nitrilotriaceticagarose (Qiagen) at 4°C for 3 h and, subsequently, at room temperature for 30 min. All following steps were carried out at 4°C. The material was washed with buffer SB and SB100 (same as SB but containing 100 mM imidazole, pH 6.5), and the fusion proteins were eluted with buffer SB500 (same as SB, but containing 500 mM imidazole, pH 6.0). Since imidazole inhibited the DNA-binding activity, the eluted fusion proteins were equilibrated in 20 mM K-HEPES, 100 mM NaCl, 1 mM MgCl 2 , 0.15 mM spermine, 0.1 mM EDTA, 0.5 mM dithiothreitol, 20% glycerol, 0.01% Tween 20, and protease inhibitors, pH 7.9. Even in the presence of protective carrier proteins, the purified proteins were stable at Ϫ80°C for only 2 weeks. The purity of the fusion proteins was followed by electrophoresis on SDS-polyacrylamide gels. The DNA-binding specificities of the purified fusion proteins and of the bacterial lysates were tested by electrophoretic mobility shift assays (EMSA).
Electrophoretic Mobility Shift Assay-EMSAs were carried out essentially as described (24). Oligodeoxyribonucleotides (nomenclature and sequences see Table I) were hybridized in a thermal cycler and 5Ј-end-labeled to a specific activity of 15,000 cpm/fmol by T4 polynucleotide kinase. When the binding activity of the bacterially expressed fusion protein p20-CGGBP-6xHis was determined, 10 g of bovine serum albumin and/or 0.15 mM spermine were added per assay to increase the stability of the protein.
Isolation of RNA, Northern Blot, and Southern Blot Hybridizations-Total cellular RNA was isolated from cell lines according to standard procedures. Electrophoresis was carried out under denaturing conditions (36). The RNA was transferred to a nylon membrane (Qiabrane, Qiagen) and hybridized (37) with 32 P-labeled probe p20-701. A dot blot membrane carrying standardized amounts of mRNA isolated from a number of human tissues (human master mRNA blot, CLONTECH, Heidelberg, Germany) was also hybridized with the p20-701 probe.  DNA isolated from rodent cell lines carrying DNA of the indicated human chromosomes was hybridized to the 32 P-labeled p20-CGGBP cDNA (probe, p20-779), and a human-specific signal was detected on chromosome 3 (see arrowhead). Also note the hybridization to genomic mouse and hamster DNAs.

FIG. 6. Recombinant p20-CGGBP-6xHis binds specifically to the double-stranded oligodeoxyribonucleotide 5-d(CGG) 17 -3-ds.
A, p20-CGGBP-6xHis and the control protein DHFR-6xHis were purified by Ni 2ϩ -chelate chromatography from bacteria expressing the appropriate constructs. B, purified recombinant p20-CGGBP-6xHis or the authentic p20-CGGBP were bound to the oligodeoxyribonucleotide (CGG) 17 ds in the presence or absence of several oligodeoxyribonucleotides as competitors (see Table I). A 100-fold molar excess of the competitor DNAs in comparison to the binding probe was used in these experiments. The specific p20-CGGBP-5Ј-d(CGG) 17 -3Ј-ds-complex cI was competed only by the homologous oligodeoxyribonucleotide. Complex cI showed identical electrophoretic mobilities with the recombinant or the authentic protein. C, lysates from a strain expressing DHFR-6xHis as well as purified DHFR-6xHis lacked any 5Ј-d(CGG) 17 -3Ј-ds binding activity.
A somatic cell hybrid panel (Oncor, Inc., Heidelberg, Germany) containing 15 g of PstI-cleaved genomic DNA from 23 different rodent cell lines each carrying one human chromosome was hybridized with probe p20-779. The DNA probe p20-779 contained the complete 779-bp insert and was isolated from the EST clone ID269133 after EcoRI and NotI cleavage. The DNA fragment p20-701 lacked the poly(A)-tail of the insert and was isolated after BfaI cleavage of probe p20-779. Hybridization probes were 32 P-labeled by randomly primed tagging (38).

p20-CGGBP Binds Directly to Its Target Sequence-Protein
fractions isolated from HeLa nuclear extracts and highly enriched for the 5Ј-d(CGG) 17 -3Ј-ds-binding activity were analyzed by Southwestern blotting for their binding activity to the target sequence of the p20-CGGBP protein (Fig. 1). Separated and blotted proteins were incubated with a 32 P-labeled oligodeoxyribonucleotide that carried either 17 5Ј-d(CGG)-3Ј repeats [(CGG) 17 ds] or with the control oligodeoxyribonucleotide (CAG) 17 ds that contained 17 5Ј-d(CAG)-3Ј repeats. A band of about 20 kDa was detected in all fractions highly enriched for p20-CGGBP with 32 P-labeled (CGG) 17 ds as a binding probe (Fig. 1, lanes designated fractions III and IV). This band was not present when (CAG) 17 ds was used as a probe. It also failed to be detected in a fraction deficient of 5Ј-d(CGG) 17 -3Ј-ds-binding activity. An additional band of 100 kDa was observed in some fractions as well as in crude nuclear extracts (Fig. 1, and data not shown) when either binding probe was used. This activity was likely due to unspecific DNA-protein interactions since this binding was also detected with HeLa nuclear extracts when a variety of single-or double-stranded oligodeoxyribonucleotides were used as binding probes (data not shown). These results confirmed that p20-CGGBP could bind directly to its target sequence without the involvement of additional cellular proteins.
Determination of the Amino Acid Sequence of the p20-CGGB-Protein: EST Clone ID269133 Contains the Complete Coding Sequence for p20-CGGBP-For the determination of the amino acid sequence of several internal peptides of p20-CGGBP, about 400 ng of p20-CGGBP were isolated from 5 ϫ 10 9 HeLa cells, separated from accompanying proteins by SDS-polyacrylamide gel electrophoresis, and transferred to a PVDF membrane. The experimental design for the determination of the amino acid sequence is outlined in Fig. 2. Extraction of the peptides from the membrane resulted in limited recovery, as seen by the poor signal-to-noise ratio in Fig. 3, upper panel. Only one peptide was apparent whose fragmentation spectrum is shown in Fig. 3, lower panel. To detect peptide ion signals below the level of chemical noise, a parent ion scan for the immonium ion of isoleucine/leucine was performed. This scan revealed two more peptides that were subsequently fragmented (Fig. 3, 2nd panel from top). Tandem MS spectra of all three peptides were generated. Interpretation of these spectra resulted in one complete and two partial sequences. The sequence of peptide "a" was determined to be FVVTAPPAR. For peptide "b," interpretation of the high m/z range resulted in the N-terminal sequence VSV(I/L) or a peptide sequence tag (17 Da) VSV(I/L) (634.4 Da). A peptide sequence tag could also be assigned to peptide "c": (172.2 Da) (I/L)YV (601.80 Da). A search of a large non-redundant protein data base containing more than 200,000 entries revealed no match to the sequence FVVTAPPAR or the sequence tags, indicating that the protein was unknown. A search in the expressed sequence tag data base (dbEST) with PeptideSearch, however, did retrieve a matching EST (clone ID269133). This clone had been sequenced from the 3Ј-(Genbank accession no. N24697) and 5Ј-ends (Genbank accession no. N36676). Peptide a was found in the former and peptide b in the latter. The retrieved sequence for peptide b was VSVIQDFVK, which matched the obtained tandem mass spectrum for this peptide. Peptide c did not match directly, but regions 1 and 2 of the peptide sequence tag were found to match, indicating a sequence error in the DNA sequencing coding for the C-terminal part of the peptide (32). Resequencing of the clone indeed revealed a different DNA sequence. After its correction the peptide sequence was TALYVPLD which also led to complete agreement with the tandem mass spectrum of peptide c. The derived amino acid sequence of the 501-bp open reading frame, obtained after double-stranded resequencing of the EST clone, thus contained all three peptides covering 28 amino acids (Fig. 3, bottom).
The analysis of the cDNA sequence revealed that the nucleotide sequence context of a putative start codon ATG (AG-GATGG) at nucleotide position 197 of the 779-bp insert was in  accordance with the Kozak rules (40) for functional translational start sites in eucaryotes. In addition, several very short ORFs were detected 5Ј of the putative start codon. The stop codon at nucleotide 698 was followed by a polyadenylation signal at nucleotide 729 and a poly(A)-tail starting at nucleotide 759. The 501-bp ORF encoded a 166-aa long protein with a molecular mass of about 19 kDa which is in accordance with the apparent molecular mass for p20-CGGBP as determined by SDS-polyacrylamide gel electrophoresis. The results indicated that the EST clone ID269133 contained the complete coding sequence of p20-CGGBP. Properties of the p20-CGGBP cDNA-Northern blot analyses of RNA isolated from human HeLa cells detected only one transcript with an apparent size of 1.2 kilobases with p20-CGGBP cDNA (probe p20-701) as hybridization probe (data not shown). This finding suggested the presence of a large 5Ј-UTR of p20-CGGBP-specific RNA. Low level expression of this RNA was found in a number of human tissues by dot blot RNA hybridization. In contrast to very low level expression in whole fetal brain, expression was high in adult cerebellum and cerebral cortex but low in other regions of the adult brain. RNA isolated from human placenta, thymus, and lymph nodes also contained relatively high levels of a p20-CGGBP-specific RNA. These results agree with the observation that several EST clones with extensive homologies to the clone ID269133 (35) had been isolated from different human tissues as well as from mouse and rat. Moreover, DNA from several mammalian sources yielded specific signals with the p20CGGBP cDNA probe, whereas DNA from non-mammalian species, except chicken DNA, did not (Fig. 4). This high degree of conservation of the p20-CGGBP sequence among mammals and its expression in a variety of many human tissues are consistent with our observation that an activity binding to the double-stranded 5Ј-d(CGG) n -3Ј trinucleotide repeat was detected in a large number of human and mammalian cell lines (24). However, a yeast sequence homologous to the EST clone ID269133 was not found. Analyses of the derived amino acid sequence did not reveal any overall homology to known proteins. Computeraided sequence analyses detected a putative nuclear localization signal between aa 69 and 84 (Fig. 3, bottom panel).
The results presented are in accordance with the finding that protein-DNA complex formation between p20-CGGBP and its target sequence cannot be competed by a set of oligodeoxyribonucleotides carrying consensus binding sequences for known DNA-binding proteins. We, therefore, conclude that p20-CG-GBP is a novel DNA-binding protein.
Chromosomal Localization of the Gene Encoding p20-CG-GBP-A somatic cell hybrid panel with genomic DNA from mouse or hamster hybrid cell lines, each carrying one specific human chromosome, was hybridized to the complete insert of EST clone ID269133 (probe p20-779). A strong human DNAspecific signal was detected with DNA from human chromosome 3 (arrowhead in Fig. 5). The intense cross-hybridization to the corresponding rodent genes (mouse and hamster) of p20-CGGBP was consistent with its high degree of conservation in mammals.
The Fusion Protein p20-CGGBP-6xHis Binds Sequencespecifically to the Double-stranded Trinucleotide Repeat 5Ј-d(CGG) n -3Ј-The binding specificity of the protein encoded by the 501-bp ORF of EST clone ID269133 was further characterized. The protein was expressed in E. coli which was chosen as a host because an activity similar to p20-CGGBP did not occur in this procaryote (see also Fig. 6C). A histidinehexamer was introduced at the N terminus of the coding sequence of p20-CGGBP starting with the second amino acid to avoid internal translational start sites. The expressed fusion protein p20-CGGBP-6xHis showed an apparent molecular mass of 20 kDa upon electrophoresis in SDS-polyacrylamide gels as expected (Fig. 6A). Purification of the recombinant protein by Ni 2ϩ -chelate affinity chromatography was feasible only under native conditions and resulted in pure p20-CGGBP-6xHis (Fig. 6A). As a control, the mammalian enzyme dihydrofolate reductase (DHFR-6xHis) was also expressed and purified from bacteria under similar experimental conditions (Fig. 6A).
Bacterial lysates were prepared from bacteria induced for the expression of p20-CGGBP-6xHis or for DHFR-6xHis as well as from uninduced bacteria. The lysates were tested for the presence of proteins capable of binding to the oligodeoxyribonucleotide (CGG) 17 ds. Formation of the specific complex cI between (CGG) 17 ds and bacterial proteins was detected only in lysates prepared from bacteria expressing the fusion protein p20-CGGBP-6xHis (Fig. 6B) and not from those expressing DHFR-6xHis (Fig. 6C).
The specificity of the complex cI between the fusion protein p20-CGGBP-6xHis and the double-stranded target sequence was assessed with purified recombinant p20-CGGBP-6xHis that was bound to (CGG) 17 ds in the presence of oligodeoxyribonucleotides containing other trinucleotide or related sequence repeats (see Table I). The formation of the complex cI, which exhibited a similar mobility as the one formed with p20-CGGBP isolated from HeLa nuclei, was competed only by the homologous oligodeoxyribonucleotide (CGG) 17 ds (Fig. 6B) and not by oligodeoxyribonucleotides carrying different trinucleotide repeats, such as (CAG) 17 ds, (CGA) 17 ds, (TGG) 17 17 . The specific complex cI was competed by all competitors used, except CGG8Ads. Similar results were obtained with purified p20-CGGBP. See also Table I. (CAA) 17 ds (Fig. 6B). Furthermore, no competition was detected with oligodeoxyribonucleotides containing base pair exchanges at the first base of every other triplet repeat. In addition, the oligodeoxyribonucleotide FraxFds, which was isolated from the FraxF locus (41) and contained only eight 5Ј-d(CGG)-3Ј repeats adjacent to other triplets, failed to compete for binding. Thus, more than eight 5Ј-d(CGG)-3Ј repeats were apparently required for proper binding. These competition patterns were identical to those described previously for HeLa nuclear extracts or for p20-CGGBP purified from them (24). However, the binding affinity of purified recombinant p20-CGGBP-6xHis was lower than that of the authentic p20-CGGBP. Possibly the recombinant protein could have been partly inactivated during or after purification by metal chelate chromatography, since the binding activity of the raw bacterial lysate was quite high (data not shown). The addition of carrier protein (e.g. bovine serum albumin) or of polyamines (e.g. spermine) increased the stability of the purified protein slightly without influencing its binding specificity.
These DNA binding studies confirmed that the ORF in the EST clone ID269133 encoded the full-length cDNA for p20-CGGBP. Furthermore, p20-CGGBP was found to be sufficient for the formation of complex cI. Thus, involvement of other cellular proteins in the formation of complex cI was unlikely.
p20-CGGBP Also Binds to Interrupted Repeats and Tolerates an A/G Mismatch in Its Target Sequence-FMR1 promoter sequences carrying more than 40 repeats of the 5Ј-d(CGG)-3Ј stretch appeared to be stable upon female transmission when these repeats were interrupted by 5Ј-d(AGG)-3Ј triplets at every 7th to 10th position (42)(43)(44). The loss of these interrupting 5Ј-d(AGG)-3Ј triplets at the 3Ј-end of a longer 5Ј-d(CGG)-3Ј trinucleotide repeat was implicated as a possible first step in the expansion.
Binding of purified p20-CGGBP or crude nuclear extract to the target oligodeoxyribonucleotide (CGG) 17 ds was competed in the presence of the oligodeoxyribonucleotide CGG10AGGds (see Table I). This oligodeoxyribonucleotide contained two 5Јd(AGG)-3Ј interruptions and an uninterrupted stretch of nine 5Ј-d(CGG)-3Ј repeats (Fig. Fig. 7). This result suggested that p20-CGGBP could indeed bind to an interrupted repeat.
Similarly, the ds oligodeoxyribonucleotides (CGG) 17 /CCG-10CCT and CGG10AGG/(CCG) 17 (see Table I) also competed for the binding to (CGG) 17 -ds (Fig. 7). These results indicate that either nine 5Ј-d(CGG)-3Ј repeats are sufficient for binding or that one mismatch (A/G or C/T) in the binding sequence can be tolerated. Competition experiments with oligodeoxyribonucleotides carrying a mismatch in the first position of every second triplet (for details and sequences see Table II) showed that the presence of several A/G mismatches obviously did not inhibit binding of p20-CGGBP to its target sequence. DISCUSSION Mechanisms underlying instability and the physiological function of trinucleotide repeats in the human genome are not understood. We have, therefore, investigated proteins that interact sequence-specifically with the unstable triplet repeat 5Ј-d(CGG) n -3Ј in the human FMR1 gene and have recently purified the 20-kDa protein p20-CGGBP from nuclear extracts of HeLa cells. This protein binds exclusively to 5Ј-d(CGG) n -3Ј repeats and not to any other unstable triplet repeat sequence (24). The cloning of the cDNA encoding this protein has now become possible with a strategy involving amino acid sequence determination by tandem mass spectrometry and screening of EST data bases with the obtained sequence tags.
From partial mass spectrometric sequence data of three peptides of the protein, a cDNA fragment has been identified in the EST data base using special software algorithms. Since the coding sequence of this protein is short, it is completely represented by the identified clone, omitting the need for any library screening and subcloning. To our knowledge this is the first reported example in which a combination of mass spectrometric sequencing, data base searching, and a generally available clone collection have made additional cloning efforts redundant. The rapid cloning of the cDNA of p20-CGGBP thus demonstrates the potential of mass spectrometric sequencing in conjunction with the screening of EST data bases. Once the protein had been purified on an SDS gel and the sequencing had started, it has taken only 2 weeks to have its clone available to express the protein and test for its putative function. The EST data base is thought to contain already more than half of all human genes (34). Since the size of the data base is still growing rapidly and due to efforts to obtain longer stretches of coding sequence, it is reasonable to expect that many human proteins may be amenable to the type of analysis described here.
Mass spectrometric sequencing has intrinsically favorable characteristics that make it the technique of choice for EST data base identifications. Much lower amounts of protein are sufficient for analysis as compared with conventional amino  (25). In one experiment several peptides can be fragmented. The chances that an analyzed peptide is represented in the EST data base are thus increased. The generation of sequence tags, short amino acid sequences together with their precise mass location in the peptide, from tandem mass spectra is relatively simple and can often be done automatically. Sequence tags have a high statistical search specificity making them a very powerful probe to locate the corresponding EST, even in the presence of DNA sequencing errors, as was the case here. We anticipate that mass spectrometric sequencing in conjunction with EST data bases will play an important role in cloning new proteins from organisms for which large EST data bases are available or from closely related organisms (45). The authenticity of the isolated cDNA is supported by the finding that the recombinant protein p20-CGGBP-6xHis exhibits the same, although weaker, DNA-binding pattern as the purified protein. Most likely, p20-CGGBP is the exclusive protein partner in the p20-CGGBP-(CGG) 17 ds complex cI. It contains at least the DNA-binding domain of the involved proteins as shown by Southwestern blot analyses. Complex cI probably consists of more than one molecule p20-CGGBP because it is highly sensitive to deoxycholate treatment (24), suggesting the involvement of protein-protein interactions (46).
The physiological function of the novel DNA-binding protein p20-CGGBP in triplet repeat function and instability cannot be derived solely from its cDNA sequence due to the lack of homology to known proteins. However, high conservation of p20-CGGBP among mammals as confirmed by Southern blot analyses (Fig. 4) and computer-aided sequence analyses as well as its expression in a variety of human tissues point to an important, if not essential, function in mammalian cells. It is attractive to speculate that the 5Ј-d(CGG)-3Ј repeat itself has regulatory functions in the expression of the adjacent FMR1 gene. It has been described that short trinucleotide repeats 5Јd(CGG)-3Ј or 5Ј-d(CTG)-3Ј could function as regulatory elements in at least two different mammalian promoters (16,17). In this context, a report about two individuals carrying an expanded, but unmethylated, 5Ј-d(CGG)-3Ј repeat in the 5Ј-UTR of the FMR1 gene without the fragile X phenotype but with normal expression of FMRP is very interesting (47). It is likely that the silencing of the FMRP expression in FraX individuals is in part due to cytosine-specific DNA methylation of the amplified repeat and adjacent sequences. Interestingly, p20-CGGBP can bind to an (un)interrupted FMR-1 triplet repeat but not to a highly methylated 5Ј-d(CGG)-3Ј repeat in vitro (24). The elucidation of the cDNA sequence of p20-CGGBP now provides a basis for the study of its role in triplet repeat function and expansion in mammalian and non-mammalian cells.