Enigmatic Distribution, Evolution, and Function of Inteins*

Inteins are mobile genetic elements capable of self-splicing post-translationally. They exist in all three domains of life including in viruses and bacteriophage, where they have a sporadic distribution even among very closely related species. In this review, we address this anomalous distribution from the point of view of the evolution of the host species as well as the intrinsic features of the inteins that contribute to their genetic mobility. We also discuss the incidence of inteins in functionally important sites of their host proteins. Finally, we describe instances of conditional protein splicing. These latter observations lead us to the hypothesis that some inteins have adapted to become sensors that play regulatory roles within their host protein, to the advantage of the organism in which they reside.

Inteins are mobile genetic elements capable of self-splicing post-translationally. They exist in all three domains of life including in viruses and bacteriophage, where they have a sporadic distribution even among very closely related species. In this review, we address this anomalous distribution from the point of view of the evolution of the host species as well as the intrinsic features of the inteins that contribute to their genetic mobility. We also discuss the incidence of inteins in functionally important sites of their host proteins. Finally, we describe instances of conditional protein splicing. These latter observations lead us to the hypothesis that some inteins have adapted to become sensors that play regulatory roles within their host protein, to the advantage of the organism in which they reside.

Protein Splicing
Protein splicing is a naturally occurring biochemical process that mediates the post-translational conversion of a precursor polypeptide into a mature and functional protein through the removal of an internal protein element, called an intein (Fig.  1A). The process is analogous to intron splicing at the RNA level. The protein splicing mechanism involves a series of autocatalytic peptide bond rearrangements, where the intein excises itself from the precursor polypeptide with concurrent ligation of the flanking sequences, called exteins (N-or C-exteins relative to the position of intein) (for review, see Refs. [1][2][3]. As a result of this process, two proteins are produced from a single polypeptide product. The term intein refers to both the genetic element in the DNA or RNA and the protein splicing entity. Most inteins are expressed within a single polypeptide chain (cis-splicing inteins), but some are split into two polypeptides each containing one extein and an intein fragment (trans-splicing inteins) (4,5). In the case of split inteins, reassociation of the fragments at a zipper-like interface (6) precedes protein splicing (Fig. 1B). Both cis-splicing and trans-splicing inteins are frequently utilized in various biotechnological applications including protein purification, modification, labeling, and posttranslational control of expressed proteins (reviewed in Refs. 7 and 8). The cis-splicing inteins often contain a distinct homing endonuclease (HEN) 2 domain (9). HEN-containing inteins are naturally occurring mobile genetic elements. The presence of a HEN provides inteins the ability to transfer their coding elements into homologous alleles at homing sites that lack the intein sequence (Fig. 1D). This HEN-mediated homing process can result in horizontal gene transfer (HGT) of inteins, by invasion of diverse species, followed by vertical transmission of inteins (10,11). Moreover, HEN-containing inteins are involved in a so-called "homing cycle" that includes two opposing processes, precise intein loss and reinvasion of a newly formed vacant homing site. It is believed that the homing cycle allows the HEN to avoid fixation and functional decay in one locus (12).

Sporadic Distribution of Inteins Relative to the Evolutionary History of Their Host
An intein was initially found in the yeast genome, in the vacuolar ATP synthase catalytic subunit A (VMA) (13,14). Since then, inteins have been identified in all three domains of life, but none have been discovered in the nuclear genome of a multicellular organism. Notably, the distribution of inteins in different phyla and even among closely related species is sporadic (15,16). To examine intein distribution, one can survey the intein database, InBase (16), for inteins across the tree of life, including viruses and bacteriophage. To more adequately sample intein occurrence, we also mined inteins from the National Center for Biotechnology Information (NCBI) Gene database (www.ncbi.nlm.nih.gov/gene). Based on the data from the NCBI, we performed a rough assessment of the number of putative intein sequences in sequenced genomes. Additionally, we searched for inteins in Eubacteria and Archaea by filtering the Gene database ( Fig. 2; the complete analysis will be published elsewhere). A summary of the intein distribution derived from these databases throughout the tree of life appears in Fig. 2.
InBase content has some bias in the dataset when compared with the NCBI Gene database because InBase was generated by submissions from individual investigators. There are phyla that are either absent from or underrepresented in InBase (e.g. Spirochaete and Firmicutes), and there are significantly more putative inteins available in the NCBI Gene database than there were submitted into InBase. However, there are many parallels between the two databases, and for both there is clearly a sporadic pattern in intein distribution. The uneven nature of intein occurrence suggests biased intein acquisition, maintenance, and/or loss (Fig. 2). whole and inteins in particular are in flux, fully explaining this phenomenon remains a challenge in evolutionary biology. There are several potential interdependent factors contributing to the observed pattern of intein distribution that should be considered. Undoubtedly, the diverse evolutionary histories of the host genomes contributed to the heterogeneity of inteins at all scales (17,18). Changes in genome size and frequent HGT should be weighed against gene-specific intein acquisition or loss resulting in the observed intein distribution. Usually, it is assumed that inteins are not under strong selective pressure, either negative or positive, because they splice out efficiently, producing fully functional host proteins. The only evolutionary pressure that is routinely attributed as acting on the inteins is selection for efficient protein splicing (10,11). However, a number of the genome-reshaping processes might directly affect intein distribution.  (2). The first residue of the intein and that of E C are usually a cysteine, serine, or threonine. They are shown here as Cys1 and Cysϩ1. B, protein trans-splicing by split inteins. The two halves of the split intein come together via a zipper-like interface. Splicing then proceeds as in A. C, an example of conditional protein splicing. A disulfide bridge, shown between cysteine residues at Ϫ3 of E N and at the first residue of the intein (Cys1), prevents protein splicing. In the presence of a reducing reagent, the disulfide bond is broken to release the trapped Cys1 and splicing proceeds. D, HEN-containing inteins are naturally occurring mobile genetic elements. After transcription and translation, precursor undergoes protein splicing with formation of the ligated exteins and intein carrying HEN domain (green). The HEN recognizes and cleaves its cognate intein-less homing site in DNA. The double-strand break is repaired by cellular double-strand break repair (DSBR) machinery using the inteincontaining allele as template, resulting in two intron-containing alleles.
Genome streamlining, the selective elimination of unnecessary DNA to decrease the metabolic burden on DNA replication, was proposed as a mechanism for genome reduction in free-living species with very large successful populations (19 -21). In contrast, parasitic and symbiotic organisms undergo significant genome size reduction due to relaxation of positive selection, a bias favoring deletions over insertions, and pseu-dogenization (22)(23)(24)(25)(26)(27). Both genome streamlining and adaptation to a symbiotic/parasitic lifestyle would result in intein loss. HGT (not associated with HEN activity and intein mobility), on the other hand, would aid the dissemination of inteins between distantly related species or even across the domains of life. HGT is common among Archaea and Bacteria and is believed to be the essential mechanism for the attainment of genetic diversity (28). Among the most prominent examples of massive and recurrent HGT is the dissemination of antibiotic resistance genes among bacterial pathogens (29 -31).
HGT is also an important mechanism in eukaryotic genome evolution, particularly in unicellular organisms (32)(33)(34)(35). More and more candidates for horizontally transferred genes are being identified: from Eubacteria to Eukaryota (33); from Eubacteria to Archaea (36); and among Eukaryota (34). Inteins might hitchhike as fragments of the horizontally transferred genes, gene clusters, genome fragments, or even whole chromosomes.

Distribution of Inteins Is Biased toward Specific Proteins
Although inteins occur in proteins with diverse functions, there is a bias for inteins to insert into proteins involved in DNA metabolism, such as polymerases (Pols), helicases, topoisomerases (TOPOs), and ribonucleotide reductases (RNRs) ( Table 1; Fig. 3A) (16). Out of 545 inteins reported in InBase, 266 occur either in polymerases, such as PolA, PolB, PolC, DnaE, and RPB1, or in helicases including replicative helicases, DnaB (bacterial), and MCM/Cdc21 (archaeal) ( Table 1). A search of the NCBI Protein database revealed that ϳ27% of proteins with inteins across all groups corresponded to DNA metabolism proteins with the highest fraction in Archaea (ϳ50% of all putative intein-containing proteins) and lowest in Eukaryota (only ϳ3%; Fig. 3A). The discrepancy between the fractions of inteincontaining DNA-related proteins in the two databases is again attributable to different data submission criteria, but in both cases, the preponderance of inteins in DNA metabolism proteins is striking.
Several hypotheses have been proposed for the biased insertion of inteins into these proteins. First, because the intein HEN is produced simultaneously with its host protein, a possible advantage for an intein could be to ensure its own presence at times of DNA replication and repair. Because intein homing requires the host replication and repair machinery, it is efficient for the intein to be produced in concert with these replication and repair proteins. A second hypothesis suggests that mobility during replication and repair decreases selection against the intein. Indeed, the result of nonspecific activity of the intein HEN can be efficiently repaired during this period, reducing the risk of intein propagation (54). Finally, Pols, helicases, and TOPOs are frequently found in viral genomes (roughly 18% of the total viral proteome available at NCBI) (37). Viruses are efficient vectors for HGT (55)(56)(57) and, thus, they might facilitate spreading of inteins across species boundaries (18,(37)(38)(39).

Inteins Tend to Be Located at Protein Active Sites
Inteins often occur at sites critical to the function of the proteins, such as catalytic centers and key binding surfaces. Interestingly, a conserved motif containing the Walker A box (58) in a phosphate-binding loop called the P-loop is a hot-spot for intein invasion in some of these proteins. For example, of 33 species/strains with inteins that we found for recombinase RadA/RecA proteins, 12 had insertions in the P-loop, which has been identified in many ATP-and GTP-binding proteins (59) (Fig. 3B). Additionally, P-loop insertions in DnaB and MCM/ Cdc21 helicases were found in 47 and 51 species/strains, respectively (Fig. 3B). Although other helicases also had intein insertions in the P-loop, this is not the case for the P-loops of Pols (Table 1) (16). Nevertheless, in line with inteins localizing to the conserved domains that are functionally important, inteins are inserted into either catalytic or ligand-binding sites of RNR, archaeal mini-chromosome maintenance protein (MCM) helicases and archaeal replication factor C (RFC), archaeal/eukaryotic VMA, and eukaryotic RNA polymerases (11,60,61).
The rationale behind the localization of inteins to the most critical domains of proteins is still a matter of debate, and might reflect differential targeting, maintenance, or loss. First, targeting of conserved sites might be explained by the specificity of intein HENs. Conserved amino acid residues within functionally important protein motifs limit the range of nucleotide sub-stitutions that can be tolerated at such sites. Thus, the chance to lose the site recognized by the HEN is lower at active centers in comparison with other regions in the protein. Second, purifying selection, which plays an important role in maintaining the long term stability of biological systems by removing deleterious mutations, will preserve the functionally important motifs, facilitating intein invasion into that site by HGT across a wide range of species. Third, inteins could insert themselves into diverse sites throughout a genome, but only insertions at specific sites might become fixed in the population. The retention of inteins in conserved protein motifs would likely be due to the low rate of intein loss through excision. The removal of the intein, if not precise, would disrupt protein function and therefore would be deleterious (11,60). If the intein were precisely excised from the genome, this new viable intein-less variant would likely be reinvaded if the intein contained an active HEN as a result of the homing cycle (12). A similar argument has been posited for retention of self-splicing introns in functionally important motifs (62,63).
Finally, the presence of inteins in particular conserved motifs might be explained by an adaptive role of inteins. Mobile elements in general have been shown to evolve diverse roles (64 -66). The vast knowledge that has accumulated about genome organization and gene expression during the last two decades has led to a paradigm shift from seeing mobile elements as solely parasitic entities to understanding their dynamic role in evolution of species (66). Despite their importance in biotechnology, inteins remain among the least studied mobile elements in terms of their possible function(s) (67). Nevertheless, the data available, especially on conditional protein splicing, suggest that inteins might be involved in regulation of function of the host protein.

Conditional Protein Splicing and Splicing Regulation
Both native and artificially designed inteins can undergo conditional protein splicing (CPS). CPS depends on the presence of a particular trigger, such as a change in redox state, temperature, small molecules, or light, as has been recently reviewed (7,8). The existence of stimulus-dependent inteins suggests the possibility that some inteins may adapt to their intracellular niche by becoming post-translational regulatory elements that modulate protein splicing in accord with environmental conditions. An example of such regulated protein splicing is provided by formation a disulfide bond by cysteine residues involved in the splicing mechanism, thereby trapping the intein inside a precursor protein depending on the redox state. Several inteins were artificially designed where disulfide bond formation controls premature cleavage or splicing reactions (68 -70) (Fig.  1C). One of these designs inspired a search for physiologically relevant regulation of protein splicing by a disulfide bond. After determining by mutational analysis that cysteine at the Ϫ3 position of the upstream extein can trap the catalytic cysteine (Cys1) at the beginning of the cyanobacterial Ssp DnaE intein (68,71), data mining revealed several native inteins with cysteine at the Ϫ3 position. Examples include the MoaA precursor protein from the thermophilic archaeon Pyrococcus abyssi, a radical S-adenosylmethionine domain protein of sulfate-reducing archaeon Archaeoglobus profundus,and the pyruvate-formate lyase-activating enzyme (PFL-AE) from an uncultured archaeon GZfos13E1 (68).
Strikingly, all three of the above mentioned proteins are predicted to catalyze redox chemistry, and each of the three inteins resides in a conserved CysXXXCysXXCys motif (where X is any amino acid residue). In all cases, the last Cys of the CysXXX- CysXXCys motif is Cys1 of the intein, whereas the central Cys corresponds to the CysϪ3 that forms a disulfide with Cys1 of the MoaA intein. Although disulfide bond formation and its regulatory role in the radical S-adenosylmethionine domain protein and pyruvate formate-lyase-activating enzyme remain to be elucidated, characterization of the MoaA intein in Escherichia coli revealed that the CysϪ3-to-Cys1 disulfide bond can control intein splicing activity depending on the redox state of the host organism (68).
There are other naturally occurring examples of redox-sensitive inteins. Another P. abyssi intein, the PolII intein, can form a disulfide bond between Cys1 and the downstream extein Cysϩ1 that prevents splicing (72). A second PolII intein, Mma PolII intein from the methanogenic archaeon Methanoculleus marisnigri, can form an internal disulfide bond that modulates splicing activity of the intein depending on the redox state of E. coli or localization to the periplasm or cytoplasm (73).
Protein splicing can also be modulated by temperature. This regulation may have physiological relevance, as it was demonstrated that the activity of various inteins from extreme thermophiles depends on temperature (74 -81). In addition to native temperature-dependent inteins, several temperaturesensitive inteins were developed to regulate gene expression using the yeast Sce VMA intein (82,83). Additionally, some inteins have a modest pH dependence with a preference for low pH, which can be amplified by mutation (84). Although the natural importance of low pH preference for splicing is not yet known, it should be noted that the modulation of the Mtu RecA intein splicing by pH occurs precisely over the range of internal pH (pH 6 -8) maintained by Mycobacterium smegmatis and Mycobacterium bovis BCG exposed to high acidity (85).
However, other inteins have been engineered to splice conditionally, and these are worth brief consideration because they may foreshadow the discovery of a similar innate control of splicing. For example, the activity of a trans-splicing intein (Fig.  1B) has been manipulated to depend on the presence of the small molecule, rapamycin, that induces intein reassociation and splicing (86,87). Another way to control the intein relies on the substitution of the HEN domain by a receptor and utilization of a receptor-ligand binding event to trigger splicing. Examples include estrogen (88 -90) and thyroid hormone (89,91) with their respective receptors, and peroxisome proliferator-activated receptor (92). Although the development of these inteins involves a directed evolution step to tune the intein response to the ligand, the regulation of intein splicing implies some cross-talk between the catalytic center of the intein and extraneous peptide fusions to the intein.
The existence of both native and synthetic CPS anticipates more examples of regulated protein splicing in vivo with the appearance of a functional protein dependent on a specific environmental stimulus. The naturally occurring inteins are thought to have undergone evolution from invasive parasites to persistent mutualists that provide adaptive post-translational regulatory advantage to the host for survival under specific cellular and environmental circumstances. We thus anticipate more examples of CPS of inteins that respond to varied environmental cues in their natural host.

Conclusions
Intein occurrence is sporadic among related species. Inteins frequently reside in proteins involved in DNA metabolism, particularly in their active centers. Consideration of the innate invasiveness of inteins, evolutionary history of intein-containing species, as well as the nature of their protein hosts is necessary for a deeper understanding of intein dynamics and distribution. Although the loss of inteins is likely to be the consequence of genome size reduction in many species, the recurrent invasion of inteins is possible largely due to their intrinsic features as mobile elements and their ability to spread vertically and horizontally among species. An understanding of the evolution and distribution of inteins will shine light on the possible functionality of inteins as unique regulatory elements. Moreover, further studies of inteins could benefit if these genetic elements were viewed as the part of a complex system involving the nature of the split protein, the intein insertion site, the host species, and its environment.