Split dnaE Genes Encoding Multiple Novel Inteins in Trichodesmium erythraeum *

Three inteins were found when analyzing a pair of split dnaE genes encoding the catalytic subunit of DNA polymerase III in the oceanic N 2 -fixing cyanobacterium Trichodesmium erythraeum . The three inteins (DnaE-1, DnaE-2, and DnaE-3) were clustered in a 70-amino acid (aa) region of the predicted DnaE protein. The DnaE-1 intein is 1258 aa long and three times as large as a typical intein, due to the presence of large tandem repeats in which a 57-aa sequence is repeated 17 times. The DnaE-2 intein has a more typical size of 428 aa with putative protein splicing and endonuclease domains. The DnaE-3 intein is a split intein consisting of a 102-aa N-terminal part and a 36-aa C-terminal part encoded on the first and second split dnaE genes, respectively. Synthesis of a mature DnaE protein is predicted to involve expression of two split dnaE genes followed by two protein cis -splicing reactions and one protein trans -splic-ing reaction. Tandem repeats in the DnaE-1 intein inhibited the protein splicing activity of this intein when tested in Escherichia coli cells and may potentially regulate DnaE synthesis in vivo . Inteins are protein sequences embedded in precursor proteins, and they catalyze the protein-splicing reaction that excises the intein and joins the flanking sequences (N- and C-exteins) with a peptide bond (1). A typical protein splicing reaction has 4 steps including N 3 S (or N 3 O) acyl shift at the N terminus of intein, trans -esterification forming

Inteins are protein sequences embedded in precursor proteins, and they catalyze the protein-splicing reaction that excises the intein and joins the flanking sequences (N-and C-exteins) with a peptide bond (1). A typical protein splicing reaction has 4 steps including N 3 S (or N 3 O) acyl shift at the N terminus of intein, trans-esterification forming a branched intermediate, cyclization of an Asn residue at the C terminus of intein, and formation of a peptide bond between the flanking exteins (2)(3)(4). A typical intein is ϳ400 aa 1 long and bi-functional, having both protein-splicing function and endonuclease function. The two functions correspond to two separable structural domains of the intein, in which the N-and C-terminal parts of intein make up the protein splicing domain and the middle part of intein makes up the endonuclease domain (5). The protein splicing function ensures the self-removal of intein from the host protein, and the endonuclease function promotes the maintenance and spread of the coding sequence of intein via a gene conversion (intein homing) mechanism (6 -8). Nearly 200 intein and intein-like sequences have been found, and most are listed in the Intein Registry (9,10). These inteins are distributed sporadically in many different proteins among bacteria, Archaea, and eukaryotes, which is consistent with inteins being mobile genetic elements capable of lateral gene transfers. Although inteins generally show low levels of sequence conservation, they appear similar in overall structure, function, and evolution origin (11)(12)(13).
Inteins were found most frequently in DNA-related proteins such as DNA polymerase, although also found in many other types of proteins, and it is possible that intein hotspots may exist in some proteins for yet unknown reason (11,13). Some inteins lack an endonuclease domain and exist as mini-inteins with protein splicing function only. A DnaE intein in some cyanobacterial species not only lacks an endonuclease domain but also exists as a split intein with its N-and C-terminal parts encoded on two separate genes. Synthesis of a mature Ssp DnaE protein required the expression of the two split dnaE genes followed by a protein trans-splicing reaction catalyzed by the split intein (14). In this study we analyzed the dnaE gene of the oceanic N 2 -fixing filamentous cyanobacterium Trichodesmium erythraeum that forms massive surface bloom (rapid cell proliferation) with important ecological consequences (15,16). Our analysis revealed three inteins clustered in a small region of the DnaE protein, which included a typical bi-functional intein, a split mini-intein, and an unusual intein containing large tandem repeats that inhibited protein splicing.

EXPERIMENTAL PROCEDURES
Gene Cloning and Analysis-GenBank TM searches and protein sequence alignments were performed using the BLAST search program (17) and the Clustal W program (18). Parts of the Ter dnaE gene were amplified by doing polymerase chain reaction (PCR) from total genomic DNA of T. erythraeum strain IMS101. The DnaE-1 intein coding sequence was PCR-amplified using primers M144 (5Ј-ATGTCCT-TCGTCGGTCYTCCATATC-3Ј) and M145 (5Ј-ATCAATAAATCGCCT-TCACATTGTAATC-3Ј) and cloned in a pDrive plasmid vector (Qiagen). To allow complete DNA sequence determination, nested deletions were made in the cloned DNA by doing partial digestion at XmnI restriction sites present once in each of the 171-bp repeat sequences. In Southern blot analysis, ϳ2 g of T. erythraeum genomic DNA was digested with restriction enzymes BfuAI and BglII, resolved by electrophoresis in 1% agarose gel, and blotted onto nylon membrane. The membrane was hybridized to a specified 32 P-labeled DNA probe for 20 -24 h and washed under high stringency (65°C in 0.2ϫ SSC and 1% SDS) before exposing x-ray films.
Protein Splicing Analysis in Escherichia coli Cells-To construct gene expression plasmids, Ter DnaE-1 intein coding sequence (PCRamplified using the above M144 and M145 primers) was inserted in the previously made pMST plasmid (19) between XhoI and AgeI sites, replacing the Ssp DnaB intein coding sequence of pMSTЈ. Protein production in E. coli cells, gel electrophoresis and Western blot analysis were performed as before (19). Briefly, cells containing the expression plasmid were grown in liquid Lurie Broth medium at 37°C to late log phase (A 600 , 0.5). isopropyl-1-thio-␤-D-galactopyranoside was added to a final concentration of 0.8 mM to induce production of the recombinant protein, and the induction was continued for 3 h at 37°C. Cells were then harvested and lysed in SDS-and dithiothreitol-containing gel loading buffer in a boiling water bath before electrophoresis in SDSpolyacrylamide gel. Western blots were carried out using anti-thioredoxin antibody (American Diagnostica) and the enhanced chemiluminescence detection kit (ECL). Intensity of protein band was estimated using a gel documentation system (Gel Doc 1000 coupled with Molecular Analyst software, Bio-Rad).

Split dnaE Genes Contain Tandem
Repeats-To find new split dnaE genes using our previous finding (14), we searched the Gen-Bank TM containing the nearly complete genome sequence of T. erythraeum strain IMS101 that had been determined at the Department of Energy Joint Genome Institute (www.jgi.doe.gov). As illustrated in Fig. 1, parts of putative split dnaE genes were found on three sequences with accession numbers AAAU01000106, AAAU01000117, and AAAU01000050, respectively. Our initial analysis indicated that the first split dnaE gene spans an undeter-mined sequence gap between the first two GenBank TM sequences.
To determine this sequence gap, it was PCR-amplified from T. erythraeum genomic DNA using primers specific to the two GenBank TM sequences, and the resulting 4-kb DNA fragment was cloned (Fig. 2). DNA sequence determination of the cloned DNA produced the missing part of the first split dnaE gene sequence and revealed large tandem repeats in which a 171-bp DNA sequence is repeated 17 times. To confirm the presence and size of the tandem repeats in the T. erythraeum genome, Southern blot analysis was performed after digesting the genomic DNA with BfuAI and BglII restriction enzymes that cut outside the tandem repeats (Fig. 2). A 0.3-kb DNA probe next to the repetitive sequence hybridized to a BfuAI-BglII DNA fragment that matched closely the 3216-bp size predicted from DNA sequence. No DNA fragment of other sizes was detected, indicating a uniform size of the tandem repeats in the cell population and their genomes. A 1.3-kb DNA probe, which is specific to the tandem repeats, also hybridized to the 3.2-kb BfuAI-BglII DNA fragment as predicted. This DNA probe did not hybridize to DNA fragment of other sizes, indicating that the 171-bp repetitive sequence was not detectable in other parts of the genome under these conditions. This is consistent with results of Blast searches of the GenBank TM , in which we did not find the 171-bp repetitive sequence outside the dnaE gene in the nearly complete genome sequence of T. erythraeum.
DnaE Protein Contains Multiple Novel Inteins-The above findings revealed in T. erythraeum (Ter) a pair of split dnaE genes that together encode a complete DnaE protein (Fig. 1). The predicted Ter DnaE protein contains three inteins named Ter DnaE-1, Ter DnaE-2, and Ter DnaE-3 inteins, respectively. The three inteins divide the DnaE protein sequence into four parts that are named E1, E2, E3, and E4 exteins and are 719, 26, 44, and 481 aa long, respectively. The predicted mature DnaE protein (1270 aa extein sequences, excluding the intein sequences) is over 60% identical to DnaE proteins of two distantly related cyanobacterial species, and this level of DnaE sequence identity is similar to that between the other two cyanobacterial species. Blast searches of GenBank TM sequences did not reveal additional dnaE-or dnaE-like genes (complete or partial) in the nearly complete T. erythraeum genome sequences. The three inteins in Ter DnaE protein are clustered in a 70-aa region of the DnaE extein sequences, with the DnaE-1, DnaE-2, and DnaE-3 inteins located after residues Leu 719 , Arg 745 , and Tyr 789 , respectively, of the DnaE extein sequences (Fig. 3A). The extein-intein boundaries were readily defined, because the extein sequences are very similar to DnaE proteins of other cyanobacteria and also because of the conserved intein sequence features (Fig. 3, B and C). A Cys follows immediately the DnaE-1 intein in the Ter DnaE protein, whereas Val is found at corresponding positions in other DnaE proteins lacking the intein. Similarly, a Cys follows immediately the DnaE-2 intein in the Ter DnaE protein, whereas Ala is found at corresponding positions in other DnaE proteins lacking the intein. The DnaE-3 intein is followed by Cys in all three compared DnaE proteins containing the intein. These are consistent with the requirement of a nucleophilic amino acid (e.g. Cys) after an intein (3) and may explain the absence of DnaE-1 and DnaE-2 inteins in the other DnaE proteins.
The Ter DnaE-1 intein has a predicted 1258-aa sequence consisting of a 299-aa N-terminal sequence, a 969-aa repetitive sequence (17 tandem repeats of a 57-aa sequence), and a 40-aa C-terminal sequence (Fig. 3B). The Ter DnaE-1 intein would be 339 aa long, after excluding the tandem repeats, and has putative intein sequence motifs (20 -22) for a protein splicing domain and an endonuclease domain (Fig. 3B), although it showed less than 15% sequence identity to other known inteins. Presence of the tandem repeats makes the Ter DnaE-1 intein three times as large as a typical intein. The tandem repeats are located at or near the C-terminal boundary between the putative endonuclease domain and the splicing domain. Coding sequence of the tandem repeats is most likely translated, because its 2907-bp sequence is maintained as a continuous open reading frame with its flanking coding sequences, while its non-coding frames contain numerous termination codons as expected. Among the 17 tandem repeats, the first 16 (R 1 -R 16 ) are 91-100% identical to each other in DNA sequence and 88 -100% identical to each other in protein sequence, and the last repeat (R 17 ) differs from the others at the C-terminal end (Fig. 3B).
The Ter DnaE-2 intein resembles typical bi-functional self-splicing inteins, with a predicted 428-aa sequence that has putative sequence motifs for both a protein splicing domain and an endonuclease domain (Fig. 3B). The Ter DnaE-2 intein is significantly similar to the Ssp DnaB intein, showing 27% protein sequence identity and 46% protein sequence similarity.
The Ter DnaE-3 intein is a split intein encoded by two split dnaE genes found on two yet unlinked sequence contigs (Fig. 1). These two split dnaE genes could be on opposite DNA strands of the T. erythraeum genome, as was the case in another cyanobacterium (14), or they could be on the same DNA strand and at least 56,449 bp apart, based on the available DNA sequences. Predicted protein sequences of the DnaE-3 split intein consist of a 102-aa N-terminal part encoded on the first split dnaE gene and a 36-aa C-terminal part encoded on the second split dnaE gene (Figs. 1 and 3C). GTG was assumed to be the start codon of the second split dnaE gene, based on Intein Containing Tandem Repeats 26316 the absence of an upstream ATG codon or an appropriate downstream ATG codon, and also based on amino acid sequence comparisons with other DnaE split inteins (Fig. 3C). The Ter DnaE-3 split intein is 59% identical to the previously characterized Ssp DnaE split intein, although the N-terminal part of the Ter DnaE-3 split intein is 21 aa shorter than that of the Ssp DnaE split intein (Fig. 3C).
Tandem Repeats in Ter DnaE-1 Intein Affect Protein Splicing-To determine whether the Ter DnaE-1 intein is functionally affected by the presence of the tandem repeats, we tested its protein splicing activity in E. coli cells. A plasmid-borne fusion gene was constructed to produce a fusion protein in which the Ter DnaE-1 intein (plus its 5-aa native extein sequence on each side) was fused to an N-terminal maltose-binding protein and a C-terminal thioredoxin (Fig. 4). Similar fusion protein containing other inteins has been used in previous studies (19), so that the protein splicing products are readily identified using SDS-polyacrylamide gel electrophoresis and Western blotting. Precursor protein, spliced protein, and excised intein were identified by their predicted sizes, and the first two were further identified using anti-thioredoxin antibody. Variations of the fusion gene were also made after deleting parts or all of the tandem repeats coding sequence. When the complete tandem repeats (17 57-aa repeating units) were present in the intein, only the precursor protein was observed at either 37 or 25°C incubation temperature, indicating the absence of protein splicing. When the number of the 57-aa repeating units was reduced to less than 4, both spliced protein and excised intein were observed in addition to the precursor protein, indicating that protein splicing had occurred. To estimate the efficiency of protein splicing, intensity of individual protein band on Western blot was used to measure the amount of that protein, and protein splicing efficiency was calculated as the amount of spliced protein divided by the sum of spliced protein and precursor protein. The efficiency of protein splicing was estimated to be 73, 55, 50, and 45% for Ter DnaE-1 intein containing 0, 1, 2, and 3 of the 57-aa repeating units, respectively. Ter DnaE-1 intein containing 4 of the 57-aa repeating units showed less than 5% protein splicing.

Split dnaE Genes and Multiple
Inteins-We have shown that two split dnaE genes together encode a DnaE protein containing three inteins in T. erythraeum, predicting that the synthesis of mature DnaE protein involves expression of the two split dnaE genes followed by two protein cis-splicing reactions and one protein trans-splicing reaction. The predicted mature DnaE protein, after excluding the inteins, is very similar to known DnaE proteins of related organisms in size and sequence, and no other dnaE-like gene was found in the nearly complete genome sequence of T. erythraeum. The three inteins were readily identified in the Ter DnaE protein, because they are insertion sequences showing clear boundaries, conserved intein sequence motifs, and significant sequence similarities to known inteins. The DnaE-1 and DnaE-2 inteins may be relatively recent acquisitions in the Ter DnaE protein, because they were found only in T. erythraeum, and they have putative endonuclease domain for intein homing. Sequence similarities (27% identical and 46% similar) between Ter DnaE-2 intein and the cyanobacterial Ssp DnaB intein are significantly higher than those seen among non-homologous inteins (12, 23), which may suggest a common ancestor for these two inteins, although their host proteins or insertion sites have no apparent similarity. The Ter DnaE-1 intein is of unknown origin, having Intein Containing Tandem Repeats 26317 very low sequence similarity to other known inteins even after excluding the tandem repeats. The Ter DnaE-3 split intein, present in at least two other cyanobacterial species (9), likely has an ancient origin preceding the divergence of the distantly related cyanobacterial species. This split intein is least likely to spread through intein homing or lateral gene transfer, because it lacks an endonuclease domain and is encoded on split genes. The DnaE-1, DnaE-2, and DnaE-3 inteins are clustered in a small (70 aa) region of the large (1270 aa, excluding the inteins) DnaE protein, suggesting that this small region may be a hotspot for intein insertions.
Intein hotspots may also exist in other proteins, such as a DNA replication factor protein in which three inteins (Mja RFC-1, RFC-2, and RFC-3 inteins) were found in an 87-aa region (9). The Ter DnaE protein is unique in that its three inteins include both cis-splicing inteins and trans-splicing split intein.
Tandem Repeats Inside Intein-The Ter DnaE-1 intein is most unusual with the presence of large tandem repeats, which was revealed in sequence determination of the cloned DNA and confirmed in Southern blot analysis of the genome. This is the first repetitive sequence found inside an intein. The tandem repeats are strikingly large (57 aa repeated 17 times), unlike the single amino acid repeats associated with Huntington's disease (24). Also unlike typical repetitive sequences that are dispersed in many parts of the genome and often not translated, the tandem repeats of Ter DnaE-1 intein were translated and not detected in other parts of the T. erythraeum genome in GenBank TM searches and in Southern blot analysis. The tandem repeats may have originated from duplications of a pre-existing piece of the intein sequence, although they did not show similarity to known intein sequences and were not required for the protein splicing function of the intein. Alternatively, the tandem repeats may have originated from exogenous sequences that entered the Ter DnaE-1 intein. Self-splicing inteins have been thought as "safe havens" for homing endonucleases present in most inteins, because the ability of self-removal through protein splicing of the intein makes them tolerated by the host protein. The Ter DnaE-1 intein can be such a safe haven to the tandem repeats, if the latter were selfish genetic elements of unknown function. The Ter DnaE-1 intein could also be a "waste bin," if the tandem repeats had no function at all.
Possible Function of the Tandem Repeats-Perhaps more likely, the tandem repeats may play a biological role in T. erythraeum, which would provide a reason for this organism to maintain the tandem repeats. Tandem repeats are prone to changes in length, because of polymerase slippage in DNA replication and unequal crossing over in DNA recombination. Tandem repeats of Ter DnaE-1 intein exhibited uniform size in PCR and Southern blot analysis, which may suggest unknown mechanism or reason for maintaining the number of the 57-aa repeating units at 17. Individual repeating units of the tandem repeats showed less than 10% DNA sequence heterogeneity and less than 12% protein sequence heterogeneity, which may also suggest unknown mechanism or reason for maintaining the tandem repeats. Different tandem repeats in a number of other proteins have been known to play structural or ligand/substrate-binding roles, which include examples like mucin, cell surface proteins, and proteins containing spectrin repeats (25)(26)(27). Our results suggested that the tandem repeats of Ter DnaE-1 intein may regulate the protein splicing activity of this intein, because protein splicing in E. coli cells was inhibited when 4 or more of the 57-aa repeating units were present in the intein. This may predict a regulatory mechanism in the native T. erythraeum cell that overcomes the inhibition of protein splicing by the tandem repeats or reduces the number of the 57-aa repeating units to less than 4, to produce a mature DnaE protein (catalytic subunit of DNA polymerase III). The predicted regulatory mechanism may regulate DNA polymerase III synthesis and in turn DNA replication and cell proliferation. Interestingly, the N 2 -fixing T. erythraeum is known to form massive ocean surface bloom (rapid cell proliferation) with important ecological consequences (15,16), although it is not known what controls the bloom, and rapid DNA replication must accompany the rapid cell proliferation. Middle, predicted sizes of protein products from different fusion protein constructs (numbered as lanes 2 through 7) containing the specified number of the 57-aa repeating units (DnaE-1 intein, 0, 1, 2, 3, 4, and 17 repeats). Fusion protein construct containing the Ssp DnaB mini-intein (numbered as 1) was included as a known standard for identifying the spliced protein (19). Bottom, observation of protein splicing. Total cellular proteins of E. coli cells producing the specified fusion protein were resolved by SDS-PAGE and visualized by Coomassie Blue staining (left panel) or Western blot using anti-thioredoxin antibody (right panel). The lane numbers correspond to the above fusion protein construct numbers. Position of the precursor protein is marked by the letter P, the spliced protein by the letter S, and the excised intein by a black dot.