Identification of an unusual intein in chloroplast ClpP protease of Chlamydomonas eugametos.

The proteasome-like ClpP protease is widely distributed and structurally conserved among bacteria and eukaryotic cell organelles. In Chlamydomonas eugametos, however, the chloroplast clpP gene predicted a much larger ClpP protein containing large insertion sequences (ISs). One insertion sequence, IS2, is 456 amino acid residues long and not similar to known proteins. Here we show that IS2 is an unusual intein, and its protein splicing activity in Escherichia coli cells can be activated by a single amino acid substitution. Analysis of IS2 sequence revealed short sequence motifs that are similar to known intein motifs, including putative LAGLI-DADG endonuclease motifs. But a histidine residue conserved at the C terminus of known inteins is replaced in the IS2 sequence by a glycine residue (Gly455), rendering the IS2 sequence incapable of detectable protein splicing when tested in E. coli cells. Changing Gly455 to histidine activated the ability of IS2 to undergo protein splicing in E. coli cells. The IS2 sequence (intein) was precisely excised from a precursor protein, with the flanking sequences (exteins) joined together by a normal peptide bond.

An intein is defined as a protein sequence embedded in frame within a precursor protein sequence that is spliced out during maturation (1). This maturation process, termed protein splicing, involves the precise excision of the intein sequence and the formation of a normal peptide bond joining the external sequences (exteins). Protein splicing can therefore be considered the protein equivalent of RNA splicing and adds another layer of complexity to the central dogma of molecular biology. Inteins have been found in all three kingdoms of living organisms and in an increasing number of proteins. These include the 69-kDa catalytic subunit of the vacuolar H ϩ -ATPase (VMA) 1 from Saccharomyces cerevisiae (2) and Candida tropicalis (7), RecA protein from Mycobacterium tuberculosis (3) and Mycobacterium leprae (4), DNA polymerases from Thermococcus litoralis (6,8) and Pyrococcus sp. strain GB-D (1), a gyrase protein (GyrA) from some Mycobacterium species (5), and an unidentified protein from M. leprae (9). Recent determination of the complete genome sequence of the methanogenic archaeon Methanococ-cus jannaschii revealed 18 putative intein sequences encoded in 14 genes (30). Most inteins have sequence motifs similar to homing endonucleases of self-splicing introns (10,11), and the S. cerevisiae VMA intein was shown experimentally to "home" into an intein-less allele (12). The inheritance of inteins could therefore be either horizontal (intein homing) or vertical (normal chromosomal transmission).
Many intein sequences were sufficient for protein splicing when placed within a foreign or target protein immediately before a nucleophilic residue (3,13,14), suggesting that all cis information required for protein splicing is contained within the intein. Protein splicing occurred when intein-containing proteins were produced in a wide range of heterologous contexts, including an in vitro splicing system (14), suggesting that the reaction is autocatalytic. Identified inteins range in size from 360 to 537 amino acid residues and have little sequence identity, although a number of short sequence motifs have been recognized that show a significant degree of conservation (9). Several models for protein splicing mechanisms have been proposed (8, 14 -17). They generally involve an N 3 O acyl shift (or N 3 S shift) at the splice sites (18), formation of a branched intermediate (19), and cyclization of an invariant Asn residue at the C terminus of intein to succinimide (20), leading to excision of the intein and ligation of the exteins.
ClpP is the catalytic subunit of the ATP-dependent and proteasome-like Clp protease that is widespread, if not ubiquitous, among prokaryotes and eukaryotic cell organelles (21)(22)(23). Genes for the ClpP protease have been characterized in many chloroplast genomes, and these invariably predicted ClpP proteins that are similar to bacterial and mitochondrial ClpP proteins in size and amino acid sequence (21,24,25). In the green alga Chlamydomonas, however, sequence determination of the chloroplast clpP gene revealed the presence of large translated insertion sequences (26). The Chlamydomonas eugametos chloroplast clpP gene predicted a protein sequence of 1010 residues long, which is five times the size of a typical ClpP protein. This unusually large size was entirely due to the presence of large translated insertion sequences. The largest insertion sequence, IS2, is 456 residues long, not present in several other Chlamydomonas species examined, and not similar to known proteins. Here we show that IS2 is a modified intein with many of the recognized intein motifs. IS2 apparently lost an important histidine residue that is conserved in known inteins, and restoring this histidine residue activated the protein-splicing activity of IS2 in Escherichia coli cells.

EXPERIMENTAL PROCEDURES
DNA Cloning and Mutagenesis-A 2-kilobase pair BamHI-XhoI (blunt-ended) DNA fragment, corresponding to the 3Ј-two-thirds of the ClpP coding sequence, was cloned into the expression plasmid vector pMAL-c2 (New England Biolabs) at the BamHI-SalI (blunt-ended) site, so that the ClpP coding sequence was in frame with the upstream vector-encoded maltose-binding protein (MBP) coding sequence. Sitedirected mutagenesis was carried out by the polymerase chain reaction-* This work was supported by a grant from the Medical Research Council of Canada. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. ‡ Present address: Dept. of Biochemistry, B400 Beckman Center, Stanford University, Stanford, CA 94305-5307.
§ Scholar of the Evolutionary Biology Program of the Canadian Institute for Advanced Research. To whom correspondence should be addressed. Tel.: 902-494-1208; Fax: 902-494-1355. 1 The abbreviations used are: VMA, vacuolar H ϩ -ATPase; IS, insertion sequence; MBP, maltose-binding protein; IPTG, isopropyl-1thio-␤-D-galactopyranoside. mediated site-directed mutagenesis method, using a mutagenic oligonucleotide primer containing the desired mutations. The mutagenized DNA fragment was used to substitute the corresponding DNA fragment in the recombinant pMAL-c2 plasmids containing the ClpP-coding sequence. The nucleotide sequence of the mutagenized DNA fragment was determined to confirm the presence of the specified mutations and the absence of unwanted mutations. Insertion of the ctag sequence was carried out by first cutting the target DNA with restriction enzyme SpeI, filling in the resulting 3Ј-recessed ends with Klenow DNA polymerase and dNTPs, and then performing religation of the resulting blunt ends.
Protein Production and Characterization-E. coli cells containing the recombinant plasmid were grown in liquid Lurie Broth medium at 37°C to late log phase (A 600 , 0.5). IPTG was added at this time to a final concentration of 1 mM to induce the production of IS2-containing fusion proteins. After 2 h of induction, cells were harvested, lysed by sonication, and separated into soluble and insoluble fractions when required. Cellular proteins were resolved by SDS-polyacrylamide gel electrophoresis and visualized by staining with Coomassie Blue R-250. When needed, a protein band of interest was excised from the stained gel, and the protein was electroeluted. Western blotting analysis was carried out using the anti-MBP antiserum from New England Biolabs. The amount of protein in individual protein bands was estimated by using a gel documentation system (Gel Doc 1000 coupled with Molecular Analyst software, Bio-Rad).
For treatment with protease Factor Xa, electroeluted protein was first dialyzed against a Factor Xa reaction buffer (20 mM Tris-HCl, pH 8.0, 100 mM NaCl, 2 mM CaCl 2 ). 1 g of Factor Xa was then added to every 10 g of protein substrate and incubated at 23°C for a specified length of time. Resulting protein fragments were resolved by SDSpolyacrylamide gel electrophoresis and transferred onto a polyvinylidene difluoride membrane by electrotransfer. The membrane was then stained in a Ponceau S staining solution (0.2% Ponceau S, 1% acetic acid in water) to visualize protein bands. The area of the polyvinylidene difluoride membrane containing the protein of interest was cut out and used for peptide analysis and protein sequencing.
Peptide analysis and protein microsequencing were carried out at the Microchemistry Facility of Harvard University. Proteins deposited on the polyvinylidene difluoride membrane were subjected to in situ (on the membrane) tryptic digest using protease trypsin, and the resulting tryptic peptides were fractionated using high-performance liquid chromatography. Fractions suspected of containing peptides of interest were analyzed by mass spectrometry to determine the precise mass of the peptides, using matrix-assisted laser desorption time-of-flight mass spectrometry performed on a Finnigan (Hemel, United Kingdom) Lasermat 2000. When needed, peptides identified by mass were subjected to further identification by protein microsequencing.

Presence of Intein Motifs in IS2-
The insertion sequence IS2 of C. eugametos chloroplast ClpP (26) was analyzed for the possible presence of intein-like features. This analysis revealed short sequence motifs ( Fig. 1) that showed significant similarity to each of the seven conserved sequence motifs (A to G) recognized previously in known inteins (9), including motifs C and E that are putative endonuclease motifs (10,12). These intein-like sequence motifs of IS2 are similar to corresponding motifs of known inteins both in primary sequences and in relative positions, although the remaining part of the IS2 sequence does not have significant sequence similarity to any of the known inteins. The IS2 sequence is bounded by a pair of nucleophilic residues (Cys and Ser) and has an Asn residue at its C terminus. These three residues have been shown in known inteins to be critical for protein splicing (14, 18 -20, 27). A His residue located immediately before the C-terminal Asn residue is also conserved in known inteins and thought to be important for protein splicing (13,15). Notably, this position in the IS2 sequence is occupied by a Gly residue.
Protein Splicing of IS2-containing Proteins in E. coli Cells-To examine whether the IS2 sequence can support protein splicing, we constructed recombinant plasmids that encode IS2-containing fusion proteins (Fig. 2). In addition to the unmodified (wild-type) IS2 coding sequence, mutant IS2 coding sequences generated through site-directed mutagenesis were also included. One mutation changed a ggt codon to a cat codon in the DNA sequence, resulting in a substitution of the corresponding Gly residue (Gly 455 ) by a His residue in the protein sequence. This substitution is located immediately before the C-terminal Asn residue of IS2. Another mutation inserted a 4-base sequence in the DNA, creating a translation termination codon (tag) at this position located 150 base pairs before the 3Ј-end of the IS2 coding sequence. Each IS2 coding sequence was flanked on both sides by in-frame coding sequences of other polypeptides: the 42-kDa MBP linked to a 105-residue ClpP sequence (SD2a) on the N-terminal side and a 107-residue ClpP sequence (SD2b) on the C-terminal side.
Each of these recombinant plasmids was introduced into E. coli cells to produce the corresponding fusion protein and to observe possible protein-splicing products (Fig. 3). In cells containing the unmodified IS2 sequence, only an unprocessed precursor protein (117 kDa) was observed after staining with Coomassie Blue (Fig. 3A), indicating that the wild-type IS2 sequence was incapable of efficient protein splicing under these conditions. When the IS2 sequence was modified by substituting the Gly 455 residue near its C terminus by a His residue, a putative spliced protein (66 kDa) was readily observed (Fig. 3A,  lane 3), indicating that the single Gly 3 His substitution activated the ability of IS2 to undergo protein splicing. As controls, mutant IS2 sequences containing a termination codon, with or without the Gly 3 His substitution, resulted in only a truncated precursor protein. An excised IS2 polypeptide (calculated size, 55 kDa) was not observed, presumably failing to accumulate to a detectable level under the conditions used.
Accumulation of the 117-kDa precursor protein in cells containing the modified IS2 construct indicated that the protein splicing did not proceed to completion. Most of the 117-kDa precursor protein was found in the insoluble fraction of the cell lysate (Fig. 3B, lanes 2-4), suggesting that its accumulation was due to misfolding or formation of inclusion bodies. The 66-kDa spliced protein also accumulated in the insoluble fraction of the cell lysate. The insolubility of both proteins was most likely caused by the presence of incomplete ClpP sequences in these proteins. Small amounts of the 117-kDa precursor protein and the 66-kDa spliced protein remained soluble and could be partially purified by using an amylose resin affinity column (Fig. 3B, lane 5). This is consistent with the expectation that both proteins contain MBP that binds to amylose resin. We were unable to make the purified precursor protein to undergo protein splicing in vitro, probably because the precursor protein was trapped in a misfolded conformation.
Western blotting was carried out as a more sensitive method to detect and identify the spliced protein, using an antiserum against the MBP sequence of the fusion proteins. Both the 117-kDa precursor protein and the 66-kDa spliced protein were recognized by the antiserum (Fig. 3B, lane 8), which further supported the assignment of these two proteins. Additional protein bands located between the 117-and 66-kDa bands were also recognized by the anti-MBP antiserum, and they appeared more abundant in cells containing the wild-type IS2 sequence than in cells containing the modified IS2 sequence (Fig. 3B,  lanes 7 and 8). These bands likely represent protein-splicing intermediates or breakdown products. In cells containing the wild-type IS2 sequence, a Western blot revealed a weak band at a position corresponding to the 66-kDa spliced protein (Fig. 3B,  lane 7). But it was not determined whether this weak band represents a small amount of the spliced protein or merely a protein breakdown product that migrates to that position.
Identification of the Spliced Protein Product-The 66-kDa protein was tentatively identified as the spliced protein because of its size, its binding to amylose resin, and its recognition by anti-MBP antiserum (Fig. 3). To confirm its identity further, the 66-kDa protein was electroeluted from SDS-polyacrylamide gels (see Fig. 3A, lane 3) and subjected to further One mutation changes a ggt codon to a cat codon in the DNA sequence, resulting in a substitution of the corresponding Gly residue by a His residue in the protein sequence. Another mutation inserts a 4-base sequence in the DNA, creating a translation termination codon (tag) at that position. Bottom, structure of fusion genes constructed to produce fusion proteins. Each fusion protein consists of a MBP sequence (hatched box) fused to a portion of the C. eugametos ClpP sequence that includes the IS2 sequence (white box) and its flanking sequences (black and gray boxes). Construct pIS-G contains the unmodified (wild-type) IS2 sequence that has a Gly residue near its C terminus. In construct pIS-H, this Gly was substituted by a His residue. pIS-GT and pIS-HT are the same as pIS-G and pIS-H, respectively, except that each contains a translation termination codon (Ter) at the indicated position. analysis. First, it was cleaved by the sequence-specific protease Factor Xa, which cuts at a known site located immediately after the MBP sequence. After incubation with Factor Xa for various lengths of time, the 66-kDa protein was cleaved into a 42-kDa fragment corresponding to MBP and a smaller fragment of 24 kDa in size (Fig. 4A).
The 24-kDa fragment was predicted to be a fusion product of the ClpP sequence domains SD2a and SD2b (see Fig. 2) as a result of protein splicing. To confirm this, the 24-kDa fragment was blotted onto a polyvinylidene difluoride membrane and subjected to peptide analysis and protein sequencing. After digestion of the 24-kDa fragment with protease trypsin, four of the resulting tryptic peptides were analyzed by mass spectrometry to determine their precise masses. For each of the four peptides, the measured mass correlated very well with the theoretical mass of a predicted tryptic peptide of a spliced protein (Fig. 4B). Among the four peptides, peptides p1 and p2 are within the SD2a sequence domain, and peptide p4 is within the SD2b sequence domain, whereas peptide p3 spans the spliced junction (Fig. 4B). Peptide p3 was then definitively identified by determining the first 30 residues of its sequence by protein microsequencing (Fig. 4B).
These results firmly identified the 66-kDa protein as a spliced protein produced by protein splicing, in which the IS2 sequence (intein) was precisely excised from the 117-kDa precursor protein, with the flanking sequences (exteins) joined together by a normal peptide bond (Fig. 4C). No RNA splicing was detected by Northern blotting or by a more sensitive reverse transcription-polymerase chain reaction method, although an unspliced RNA transcript was detected by Northern blotting (data not shown). This is consistent with the absence of recognizable RNA intron features in the IS2 coding sequence (26). Our observation that complete translation of the IS2 sequence is necessary for producing the 66-kDa spliced protein (Fig. 3A) also argues against RNA splicing and translational bypassing (28). DISCUSSION The insertion sequence IS2 of C. eugametos chloroplast ClpP protease was identified as an intein, not only by the presence in its sequence of recognizable intein motifs, but also by the observation of protein splicing of IS2-containing proteins in E. coli cells. Identification of this chloroplast-encoded intein extended the distribution of intein-coding sequences to a cell organelle genome. Previously identified intein-coding sequences were found in a nuclear genome (2, 7), eubacterial genome (3)(4)(5), and archaebacterial genome (6). A 150-residue insertion sequence in the chloroplast DnaB protein of Porphyra purpurea was noted previously to also have some of the putative intein motifs (9,29), which may prove to be another intein encoded by a cell organelle genome. It was noted recently that inteins were found predominantly in proteins involved in DNA metabolism (5), which included DNA polymerases, a DNA recombinase, a DNA gyrase, and possibly a DNA helicase. The intein-containing ClpP protein, being a protease (21-24) not known to interact with DNA, is therefore an exception.
The ClpP IS2 intein apparently lost an important His residue (replaced by Gly 455 ) that is highly conserved in known inteins. It takes a minimum of two nucleotide substitutions to change a His codon into a Gly codon, and no RNA editing is known in this organism. This suggests that the IS2 intein may be somewhat degenerate and may have accumulated additional, although less critical, mutations in other parts of its sequence. Although the IS2 intein apparently retains many of the putative intein motifs, the degree of overall sequence similarity between IS2 and other inteins is extremely low and practically undetectable. When only the seven conserved intein FIG. 4. Identification of the 66-kDa protein as a spliced protein. A, cleavage by protease Factor Xa. The 66-kDa protein (Fig. 3A,  lane 3), isolated by excising and electroeluting from the SDS-polyacrylamide gel, was treated with protease Factor Xa that cleaves at a sequence-specific site located between the MBP sequence and the ClpP sequence. Lanes 1-4 correspond to 0, 30, 60, and 120 min of treatment, respectively. Cleavage products were resolved by SDS-polyacrylamide gel electrophoresis and visualized by Coomassie Blue staining. The 66-kDa protein and its cleavage products (42-kDa MBP and 24-kDa ClpP fragment) are marked. B, identification of the 24-kDa fragment by peptide analysis and protein sequencing. The predicted sequence of the 24-kDa fragment is shown. Four tryptic peptides (p1-p4) that were identified by mass spectrometry are underlined. In parentheses, the first number is the theoretical mass of that peptide, and the second number is the measured mass of that peptide. The first 30 residues of peptide p3 revealed by protein microsequencing are boldface and underlined, and the spliced junction is marked by a star. C, ClpP protein splicing. The 117-kDa precursor protein is illustrated at the top, with amino acid sequences of the splice sites shown. The 66-kDa spliced protein is illustrated in the middle. A 30-residue polypeptide sequence spanning its spliced junction was determined by protein sequencing and is shown. This sequence matches precisely the predicted sequence of a spliced protein. An excised IS2 protein (predicted to be 55 kDa) is expected and illustrated at the bottom, although this protein was not identified in this work. sequence motifs (a total of 93 positions) were compared, sequence identities between IS2 and other inteins range from 30% for the Pps1 intein of M. leprae, 25% for the VMA1 intein of yeast, and 19% for the RecA intein of M. tuberculosis to much lower for others. Low sequence similarities between IS2 and other inteins may suggest independent origins, long periods of separate evolution, or low functional constraint. Although an origin of the ClpP IS2 intein is not known, it probably was acquired by C. eugametos relatively recently, because several other, closely related Chlamydomonas species lack this intein (26). The ClpP IS2 intein contain two putative endonuclease motifs (C and E) that could be associated with intein mobility (intein homing), although endonuclease activity has not been detected.
The ClpP IS2 intein was incapable of detectable protein splicing in E. coli cells under conditions used in this study, whereas replacing the single Gly 455 residue with a His residue activated the protein-splicing activity. This suggests that a His residue at this position plays an important role in protein splicing, consistent with the observation that this His residue is highly conserved among known inteins. Although an explicit role for this conserved His residue has not been demonstrated in recent studies of protein-splicing mechanisms, it was proposed in an earlier model that this conserved His residue may assist the N 3 O acyl shift at the splice sites (15). In the yeast VMA intein, experimentally replacing this His residue by a Gly residue completely abolished its protein-splicing activity, whereas replacement by Lys, Glu, Val, and Leu residues resulted in a lower rate of protein splicing (13). Interestingly, among the 18 putative intein sequences revealed recently in the M. jannaschii genome sequence, two of them also lack this conserved His residue (30).
In the ClpP IS2 intein, the natural substitution of the conserved His residue by a Gly residue (Gly 455 ) raises questions about the in vivo protein-splicing activity of this intein. The conserved His residue may not be required for efficient ClpP protein splicing in its native chloroplast environment, and the in vivo splicing might be assisted by unknown trans-acting factors. A slow splicing reaction may provide a means of regulating the ClpP protease activity, and an unexcised IS2 intein might also be tolerated under some conditions. Our effort in detecting the chloroplast ClpP protein (precursor or spliced protein) in C. eugametos cells by Western blotting has not been successful, presumably due to a combination of weak anti-ClpP antibodies used in the study and a low level of the ClpP protein in the cell.