Distant structural homology leads to the functional characterization of an archaeal PIN domain as an exonuclease.

Genome sequencing projects have focused attention on the problem of discovering the functions of protein domains that are widely distributed throughout living species but which are, as yet, largely uncharacterized. One such example is the PIN domain, found in eukaryotes, bacteria, and Archaea, and with suggested roles in signaling, RNase editing, and/or nucleotide binding. The first reported crystal structure of a PIN domain (open reading frame PAE2754, derived from the crenarchaeon, Pyrobaculum aerophilum) has been determined to 2.5 A resolution and is presented here. Mapping conserved residues from a multiple sequence alignment onto the structure identifies a putative active site. The discovery of distant structural homology with several exonucleases, including T4 phage RNase H and flap endonuclease (FEN1), further suggests a likely function for PIN domains as Mg2+-dependent exonucleases, a hypothesis that we have confirmed in vitro. The tetrameric structure of PAE2754, with the active sites inside a tunnel, suggests a mechanism for selective cleavage of single-stranded overhangs or flap structures. These results indicate likely DNA or RNA editing roles for prokaryotic PIN domains, which are strikingly numerous in thermophiles, and in organisms such as Mycobacterium tuberculosis. They also support previous hypotheses that eukaryotic PIN domains participate in RNAi and nonsense-mediated RNA degradation.

The explosive growth of whole genome sequencing efforts, and the discovery that a large proportion of the assumed gene products are of unknown or poorly understood function, has focused attention on new approaches to assigning function. In the absence of sufficient sequence similarity to clearly infer homology with already characterized proteins, a variety of bioinformatic approaches have been used to obtain functional clues. These include, for example, analyses of genome location (seeking potential operons), phylogenetic profiling, and observations of gene fusions in different species (1,2). An alternative, complementary, approach is to use analyses of protein three-dimensional structure to derive functional insights, because three-dimensional structure is conserved in evolution much more strongly than sequence. This provides a rationale for a number of structural genomics initiatives (3)(4)(5)(6).
As part of a pilot structural genomics project aimed at the discovery of biological function, we have focused on gene products from the hyperthermophilic crenarchaeon Pyrobaculum aerophilum, an organism whose complete genome sequence was published recently (7). A whole-genome comparison of P. aerophilum and Mycobacterium tuberculosis, two organisms with very different, and in a sense extreme, lifestyles, led us to identify a set of 250 pairs of orthologous genes that are both widely distributed in nature and are shared by these two organisms. Among these were a set of four genes from P. aerophilum (PAE0151, PAE0285, PAE0337, and PAE2754) and four from M. tuberculosis (Rv0065, Rv0549, Rv0960, and Rv1720) that have since been clustered at NCBI as part of COG4113, with members drawn from Archaea, cyanobacteria, actinobacteria and ␣-proteobacteria (Table I). These are now annotated in Pfam as PIN domains.
PIN domains, named for their homology with the N-terminal domain of the pili biogenesis protein (PIN, PilT N terminus) (8), comprise a very large family of proteins with representatives in all three kingdoms of life. They are classified in the Pfam data base (www.sanger.ac.uk/Software/Pfam) as PF01850, currently with more than 340 members. Functional annotation of the PIN domains is equivocal. They were initially thought to function in signaling (9), but a recent bioinformatic analysis of nearly 100 PIN domain sequences identified a set of five conserved acidic residues and a sixth conserved position where there is either a serine or threonine residue. This has led to a suggested exonuclease function (10). In eukaryotes, for example, the Caenorhabditis elegans PIN domain, smg-5, and the yeast PIN domain, NMD4p, are postulated to be ribonucleases that bind to helicases as part of the machinery for RNA degradation via the RNA i and nonsense-mediated RNA degradation (NMD) 1 pathways (10).
In Archaea and thermophilic bacteria, PIN domains have been associated with a possible role in DNA repair. A recent analysis of conserved gene context across fully sequenced prokaryotic genomes revealed, in most Archaea and some thermophilic bacteria, a previously unrecognized cluster of genes containing DNA polymerases, helicases, nucleases, and many conserved hypothetical open reading frames, one of which is a PIN domain clustering in COG1848 (11). This suggested a new DNA repair system in these organisms. DNA repair, and particularly mismatch repair, in thermophiles is a vexing question. The absence of key mismatch repair enzymes such as MutS and MutL, which are highly conserved in mesophiles from Escherichia coli to humans, has been suggested to result in "mutator" lifestyles for some thermophiles (7), in which adaptive mutations enable the organism to adapt to stress or extreme environments (12). Also absent from many of the fully sequenced thermophilic genomes are several well conserved nucleotide excision repair enzymes. The discovery of a new DNA repair operon in Archaea addresses this question and by implicating PIN domains as part of this operon adds another piece of functional evidence for this large protein family.
Here we present the first crystal structure of a PIN domain, from the crenarchaeon P. aerophilum. The protein, the gene product of open reading frame PAE2754, proves to be a distant structural homologue of T4 RNase H and other exonucleases, despite insignificant sequence identity. Strict conservation of the active site residues suggests that this PIN domain is indeed an exonuclease. We have confirmed this functional hypothesis in vitro. This has important implications both for archaeal DNA editing and for the role of PIN domains in eukaryotic RNA editing. It is also an illustration of the power of structural genomics whereby deep phylogenetic lineages are apparent at the structural level and lead directly to functional characterization of proteins of previously unknown function or with equivocal and/or general functional annotation.

EXPERIMENTAL PROCEDURES
Protein Expression, Purification, and Crystallization-The predicted open reading frame PAE2754 was amplified from genomic DNA by PCR, subcloned into the expression vector pPROeX (Invitrogen), transformed into E. coli BL21(DE3) cells, and expressed as an N-terminal His 6 -tagged protein. Purification involved a heat step, in which incubation for 40 min at 80°C denatured a large fraction of the E. coli proteins, followed by Ni 2ϩ affinity chromatography and size exclusion chromatography, as described (13).
Two site-specific mutations, L65M and L80M, were designed and introduced to facilitate structure determination by multiwavelength anomalous diffraction (MAD) methods using the selenomethionine (Se-Met)-substituted protein. The single mutants were individually made and tested for expression and crystallization, followed by the double mutant L65M/L80M. Mutagenesis was performed with the QuikChange site-directed mutagenesis kit (Stratagene). The double mutant was stable at 80°C suggesting that the structure was not significantly destabilized by these mutations. The plasmid encoding the double mutant was transformed into the methionine auxotroph E. coli strain DL41(DE3) and grown in LeMaster medium with SeMet as the only methionine source. The SeMet-substituted double mutant protein (SeMet-PAE2754_MM) was then purified as above.
Both native PAE2754 and SeMet-PAE2754_MM were crystallized as described (13) and flash-cooled for data collection by soaking in cryoprotectant (mother liquor plus 10% glycerol) immediately prior to placement in a stream of cold N 2 gas at 110 K.
Structure Determination and Refinement-Native PAE2754 x-ray diffraction data to 2.5-Å resolution were collected at the National Synchrotron Light Source, Brookhaven, on beamline X8C ( ϭ 1.0000 Å). MAD data at two wavelengths were collected for SeMet-PAE2754_MM at the Stanford Synchrotron Radiation Laboratory, beamline 9-1. The data were processed using DENZO and SCALEPACK (14). Data collection and refinement statistics are given in Table II. The structure of PAE2754_MM was determined by single anomalous diffraction using a single SeMet data set at ϭ 0.9794 Å. This is not where the anomalous differences are maximized for selenium, but this was necessary because the "remote" wavelength data set proved to be of poor quality due to crystal decay (data were initially collected at two wavelengths in accordance with Ref. 15). Using SOLVE (16,17), a total of 17 out of a possible 24 selenium sites (from the 8 monomers in the crystal asymmetric unit) were located, based on anomalous differences. These gave initial phases to 2.8 Å with a figure of merit of 0.18 and a Z-score of 20.6. The phases were improved using maximum likelihood density modification via RESOLVE (16). Successful improvement of the phases was dependent on user-defined non-crystallographic symmetry elements based on the initial selenium positions.
Much of the core structure (ϳ74 residues per monomer) was built automatically with RESOLVE (16) and TEXTAL (18,19), and the rest of the structure was built manually, during 12 cycles of model building with O (20) and refinement with CNS (21). Due to the relatively low resolution of the data, non-crystallographic symmetry was included as a restraint in all refinement cycles, and weighted experimental phases from RESOLVE were also included at all stages. The final PAE2754_MM structure was then used as a molecular replacement model for the native data, and this structure was completed using 5 cycles of model building and refinement. Refinement statistics for both models are given in Table II.
In Vitro Exonuclease Assays-An 18-bp primer (5Ј-CGCGCCGTTG-CTATCTCC-3Ј) was annealed to a 54-bp primer (5Ј-ATTGAGAAATTC-ACGGCGNNKATANNKNNKGTTNNKGGAGATAGCAACGGCGCG-3Ј; where N ϭ T, G, A, C; and K ϭ T, G) to form double-stranded DNA with a 36-bp, 5Ј-3Ј single-stranded, randomized overhang. 200 pM DNA was then mixed with 200 pM PAE2754 in 20 mM NaCl and 10 mM MgCl 2 (MgCl 2 was omitted from the negative control) and incubated at 37°C. The reaction was stopped at different time points by the addition of formamide gel loading buffer (80% v/v formamide, 10 mM EDTA), followed by freezing at Ϫ20°C. Samples were run on a 20% polyacrylamide-urea denaturing minigel and visualized using ethidium bromide staining. Samples were also prepared in the same way using MnCl 2 as the metal ion source.

RESULTS
Crystal Structure of PAE2754 -The crystal structure of the protein encoded by the open reading frame PAE2754 from P. aerophilum was solved at 2.8-Å resolution, as a selenomethionine (SeMet)-labeled double mutant, incorporating two Leu 3 Met mutations, that was constructed to facilitate phasing by MAD methods. This derivative structure was then refined to give a final R-factor of 0.226 (R free ϭ 0.279). The native structure was then solved by molecular replacement and refined at 2.5 Å resolution to a final R-factor of 0.250 (R free ϭ 0.305). The resulting model has good stereochemistry ( Table II) with 92% of residues in the most favored region of the Ramachandran plot.
The PAE2754 monomer forms a single domain in which the 133-residue polypeptide is folded as an ␣/␤/␣ stack, with a central twisted parallel ␤-sheet of five short strands (Fig. 1). The strand order is 32145, and the twist of the sheet is such that the outer strands, which involve just three residues each, are oriented at 160°with respect to each other. Between strand 2 and strand 3, helices ␣2 and ␣3 pack in an antiparallel manner to form a long protrusion that extends orthogonally from the ␣/␤/␣ stack. Hydrophobic cores above and below the central ␤-sheet stabilize the ␣/␤/␣ stack, and a third hydrophobic mini-core is formed below the stack by the orthogonal pack- ing of helices ␣4 and ␣5. These hydrophobic cores are highly populated by Ala, Leu, and Val residues, which compose no less than 40% of the PAE2754 sequence. Dynamic light scattering measurements show that PAE2754 forms a tetramer in solution, and accordingly the 12 monomers in the crystal asymmetric unit are organized as three tetramers (eight monomers in the asymmetric unit are organized as two tetramers for PAE2754_MM). The tetramer is best described as a dimer of dimers (Fig. 2). An extensive dimer interface is formed by the 2-fold related packing of a nearly continuous region of sequence between residues 32 and 94, spanning helices ␣2, ␣3, and ␣4. This interface buries 1440 Å 2 of surface area (19% of the total monomer surface) and is dominated by hydrophobic interactions, marked by a striking interdigitation of many large hydrophobic side chains. There are just six hydrogen bonds between the two protein chains, all centered around the stacked histidine aromatic rings at the center of the interface.
The tetramer is formed by the association of two dimers via a relatively small interface, in which the C terminus of helix ␣2 from one monomer contacts the N terminus of helix ␣6 from another. Stabilizing interactions at this interface are modest and include a salt bridge between Arg-48 from one monomer and Asp-110 and Glu-112 from another. The backbone carbonyl group from Arg-48 also hydrogen-bonds across the interface to the amide group of Arg-111. Although only 450 Å 2 of surface area per monomer is buried on formation of this interface, the cooperative association of two dimers buries in total 4 ϫ 450 ϭ 1800 Å 2 . There is also a diagonally stabilizing interaction across the tetramer whereby a chloride ion linearly coordinates two arginine residues. The chloride lies at the center of a sphere of charged side chains that include Arg-48 (chain A), Glu-112 and Lys-116 (chain B), Glu-112 and Lys-116 (chain C), and Arg-48 (chain D).
Residues that are conserved across COG4113 (Figs. 2 and 3) are clustered in a pocket formed at the C-terminal end of the ␤-sheet and the N termini of helices ␣2 and ␣6. This arrangement brings together four conserved acidic residues (Asp-8, FIG. 1. Structure and topology of the PAE2754 monomer. A stereo ribbon diagram showing the organization of secondary structure elements for PAE2754. ␣-Helices are labeled sequentially (with the exception of helix 6 whose label is omitted for clarity), and the N and C termini are also labeled. The central, twisted, parallel ␤-sheet is shown in green, with the strand order 32145 proceeding perpendicular to the page from front to back. This figure along with Figs. 2 and 4 were drawn using PyMol (38).
Glu-38, Asp-92, and Asp-110 in PAE2754) that point into the pocket and create a highly negatively charged hole. Two other conserved residues on either side of Asp-110, Thr-108, and Leu-112, also flank the acidic pocket with Thr-108 being hydrogen-bonded to Asp-8. In the PAE2754 dimer (Fig. 2), the two acidic pockets are ϳ20 Å apart and are separated by an intriguing structure formed by the one remaining fully conserved residue, Tyr-91; the two Tyr-91 side chains lie adjacent at the dimer interface, with their aromatic rings parallel and 6 Å apart. Upon formation of the tetramer, the four active site pockets lie in the interior of a tunnel with restricted access via two openings on opposite sides of the tetramer (Fig. 2). Adjacent lysine and glutamic acid residues (Lys-45 and Glu-46) from each monomer flank the entrances to the tunnel.
Structural Comparisons-A DALI search (22) using the monomer structure as a search model gave no significant matches to structures in the Protein Data Bank. The best structural match was to a porcine D-amino acid oxidase (DAO) (23) with a Z-score of 3.3, a root mean square difference (r.m.s.d.) in atomic positions of 3.9 Å for 97 C␣ positions, and 8% sequence identity. The topological match between PAE2754 and DAO in the overlaid region was relatively good, with matches for all the major elements of secondary structure in PAE2754 except ␣4, although DAO does have two large insertions of 90 amino acids (between ␣2 and ␣3) and 110 amino acids (between ␤4 and ␤5). DAO has an FAD cofactor whose nucleotide component binds into the hole where the active site is hypothesized for PAE2754. This appeared to support the PIN domain annotation of "possible nucleotide-binding protein" from the major data bases and was also consistent with the RNase hypothesis of Clissold and Ponting (10). A second match to the ADP binding domain of trimethylamine dehydrogenase (24), where the nucleotide was similarly orientated, added weight to this hypothesis.
Perhaps more significant, however, was the presence in the top 10 DALI structural matches of the T4 RNase H structure (25). This also has low DALI scores (Z ϭ 2.8, r.m.s.d. ϭ 3.6 Å over 84 amino acids and 10% sequence identity), and the topological matches between PAE2754 and T4 RNase H were significantly poorer than for DAO, with no matches for ␣1, ␣2, ␣4, or ␣7 of PAE2754. What was striking, however, was the observation that residues that are conserved across COG4113, including the acidic residues at the putative active site, aligned structurally with similar residues in T4 RNase H that are involved in Mg 2ϩ binding and catalysis (Fig. 4). Furthermore, these residues are also conserved across a large family of related prokaryotic exonucleases. This led us to test Mg 2ϩ binding and DNase activity in vitro. Thus, although the fold of PAE2754 is most closely related to domains that bind ADP or FAD, suggesting nucleotide binding as part of its core function, sequence conservation at the active site predicts that PAE2754 (along with the many other PIN domains) belongs to the T4 RNase H family of exonucleases (26). Furthermore, once the superposition of T4 RNase H on to the PAE2754 structure could be established, it became clear that other members of the exonuclease family, including the exonuclease domain of TaqDNA polymerase (27,28) and the Flap endonuclease-1 (FEN-1) (29, 30), which has both endoand exonuclease activities, are also structurally related to PAE2754 and, by extension, to other PIN domains.
In Vitro Tests for Exonuclease Activity-Exonuclease assays were carried out using synthetic DNA primers designed to give a long 5Ј-3Ј single-stranded overhang. A time course incubation clearly shows PAE2754 has a Mg 2ϩ -dependent exonuclease activity (see Fig. 5). This experiment was repeated with Mn 2ϩ in place of Mg 2ϩ with equivalent results (data not shown), but no activity was seen in the absence of a suitable divalent cation. These initial tests show that the cleavage of single-stranded DNA by PAE2754 is slow and requires equimo- Oligomeric state for PAE2754 showing conserved residues. A, view of the dimer showing the residues that are conserved across COG4113 (also shown in the alignment in Fig. 3). For chain A, three of the residues are labeled with their one-letter amino acid code. B, surface depiction (in the same orientation as A) of the PAE2754 dimer showing electrostatic charges at the surface using GRASP. C, orthogonal views of the tetramer surface showing the tunnel inside which the putative active site resides. This figure was drawn using PyMol (38) and GRASP (39). lar amounts of DNA, Mg 2ϩ , or Mn 2ϩ and protein to provide catalysis. The sluggish reaction may be the result of the nonoptimal substrate and/or the non-optimal temperature of the assay; we presume that the optimal temperature for this enzyme is 95-100°C. Assays to determine substrate specificity and the optimal temperature are the subject of ongoing work.
Putative Active Site-The four conserved acidic residues in each monomer, Asp-8, Glu-38, Asp-92, and Asp-110, are clustered together in a surface pocket, facing into the tunnel through the center of the tetramer. Two of these residues, Asp-8 and Asp-110, are at the N termini of helices ␣1 and ␣6, respectively, and their carboxylate groups are fixed in place by typical helix N-cap hydrogen bonds (with Ala-11 NH and Tyr-113 NH). This is reminiscent of the first Mg 2ϩ site in the Pyrococcus furiosus FEN-1 endo/exonuclease (29), where both the Asp residues that directly coordinate the Mg 2ϩ ion are fixed at helix N termini. It seems likely, by analogy, that the Mg 2ϩ site in PAE2754 may be similarly pre-organized, with Asp-8 and Asp-110 directly coordinating the metal. Asp-92 could also coordinate a metal ion bound in this way, either directly or indirectly via a water molecule, but Glu-38 is more remote (6 Å away). If two Mg 2ϩ ions are bound, as is the case in T4 ribonuclease H, the flap exonucleases, and many other exo-and endonucleases and polymerases, it is likely that Glu-38 would participate in binding the second Mg 2ϩ ion. If this is so, the distance between the two Mg 2ϩ ions will be somewhere between the ϳ4 Å seen in the Klenow fragment of DNA polymerase (31) and the ϳ8 Å seen in the T5 5Ј-exonuclease (32).
The invariant threonine residue, Thr-108, is a candidate for involvement in catalysis but is fairly well buried, hydrogenbonded to Asp-8 O␦-2 and Asp-110 NH, and it is difficult to see how it can play any direct role. Two other hydroxyl-containing residues, Ser-10 and Thr-89, which are almost fully conserved, are also adjacent to the metal site where they are fully exposed in the central tunnel and could play a role in catalysis or binding. Two other features of the active site region seem likely to be important. First, the pairwise, parallel, stacking (Fig. 2) of the aromatic rings of the conserved Tyr-91 residues on the inner surface of the central tunnel suggests that these residues could participate in substrate binding by stacking on either side of a nucleotide base. Aromatic residues have been found previously (33,34) to perform such a function in, for example, single-stranded RNA and DNA binding domains. Second, side chains of Lys-45 from each monomer project into the tunnel and could be involved in binding to nucleic acid phosphate groups. This residue is almost fully conserved as Lys or Arg.
The active site is only accessible from inside the tunnel through the tetramer, implying that nucleic acid substrates must thread through it. The diameter of the opening to the tunnel is an oval with dimensions of ϳ10 ϫ 14 Å, too small for double-stranded DNA or RNA, but consistent with the protein cleaving overhanging single-stranded nucleic acids or flap structures, as is the case for flap endonucleases. DISCUSSION PIN domains, of which PAE2754 is the first to be structurally characterized, appear to be highly abundant and to have a deep lineage through all three kingdoms of life. Thus, Clissold and Ponting (10) identified 95 PIN domain sequences from a 90% non-redundant data base using HMM sequence searching methods, Makarova and Koonin (35) find 122 PIN sequences, from all three kingdoms, and Pfam annotates 345 proteins as containing a PIN domain. No fewer than 10 related COGs bear the annotation "predicted nucleic acid-binding protein, contains PIN domain." It is a measure of the increasing power of bioinformatics approaches that Clissold and Ponting (10) were   FIG. 3. Sequence and structural alignments for PIN domain proteins. A, selection of protein sequences from COG4113 aligned using the program ClustalX (40). Seven positions are highlighted in red identifying the near-strict conservation at these positions across this cluster. Sequences are named according to their genome and open reading frame numbers in accordance with Table I. B, sequences from five structures aligned based on structural homology using DALI. Secondary structure elements for PAE2754 are shown above this alignment, and the five conserved positions are highlighted in red. Numbers in parentheses within the sequences indicate insertions, the number of residues in a loop in each structure that does not structurally align among the group. Numbers in parentheses at the C termini of the sequences indicate the number of amino acids in the full-length proteins.
able to cluster the PIN domains with exonucleases and hypothesize their role in RNA i and nonsense-mediated RNA degradation when just 5 of ϳ135 sequence positions are conserved across the domain.
Our structure of PAE2754 places these conserved residues close together in three-dimensional space. Together with our experimental evidence for nuclease activity, and the demonstrated structural homology with known exonucleases such as T4 ribonuclease H and the flap exonucleases, it further provides compelling evidence that PIN domains do indeed play a role in DNA and/or RNA editing processes through Mg 2ϩ -dependent exonuclease activity.
Multiple sequence alignments of PIN domains based on COGs or Pfam show the following three principal features: (i) three conserved aspartic acid residues, one near the N terminus of the protein and two clustered near the C terminus; (ii) a conserved threonine or serine adjacent to the last conserved aspartic acid, forming a (T/S)XD motif in which the threonine or serine possibly plays a catalytic role; (iii) a well conserved acidic residue (either Asp or Glu) in the center of the sequence. The acidic residues are clustered in such a way that they could support the coordination of either one or two Mg 2ϩ ions. If two Mg 2ϩ ions are bound, as is commonly the case in polymerases and nucleases (25), and has been proposed as essential for catalysis (31), it is probable that one will be bound to Asp-8, Asp-92, and Asp-110, which are in close proximity, and the second to Glu-38. The former site may be of higher affinity given that the side chains of Asp-8 and Asp-110 are fixed at the N termini of ␣-helices. On the other hand, a direct and essential role for Thr-108 in catalysis seems less likely, given its relatively buried location.
The ways in which PIN domains associate into oligomers, or are combined with other domains or other proteins, are likely to determine the types of editing in which they are involved and the types of substrates on which they act. Thus, the PAE2754 structure is tetrameric, both in solution (as shown by dynamic light scattering and gel filtration) and in the crystal, and the tunnel through the center of the tetramer provides quite restricted access to the active site. Our preliminary modeling suggested that only single-stranded nucleic acids, or singlestranded overhangs from duplex structures, could serve as substrates. This hypothesis is supported by our in vitro assays, using double-stranded DNA with a single-stranded overhang as a template, although we have not demonstrated any specificity in either substrate or the direction of DNA cleavage, and further functional studies clearly need to be carried out. We also note that the FEN1 endonuclease from Pyrococcus horikoshii forms a dimer that is topologically similar to the PAE2754 tetramer, with the two active sites inside the dimer and access via a hole to the exterior.
In contrast, a second example of what is clearly a PIN domain structure has recently been deposited in the Protein Data Bank (ID code 1O4W, deposited by the Joint Centre for Structural Genomics (San Diego)). This protein shares 23% sequence identity with PAE2754 and forms a very similar monomer in which 101 residues match with an r.m.s.d. of 3.0 Å. This Archaeoglobus fulgidus PIN domain forms a dimer in which the monomers are tethered by a stretch of 10 amino acids at the C terminus of the protein, thus separating the monomer active sites by 42 Å and creating a very different molecular environment from that of the four active sites in the PAE2754 tetramer.
An intriguing feature of the phylogenetic distribution of PIN domains is that they seem to be amplified in a number of species, sometimes to a remarkable extent. These include A. fulgidus, P. horikoshii, and Methanococcus jannaschii, all of which are thermophilic euryarchaeota. Among mesophilic bacteria, PIN domains also seem to be extraordinarily amplified in M. tuberculosis. COG1848, proteins from which have been predicted to be part of a new DNA repair system (11), includes no fewer than 14 PIN domain proteins from M. tuberculosis. Additionally, three other COGs are found to include a further 12 M. tuberculosis PIN domain proteins between them. This presents the intriguing question as to why M. tuberculosis would have 26 PIN domains as part of its DNA or RNA editing machinery.
This augmentation of the exonuclease family and the accumulation of PIN domain proteins in certain species raise intriguing questions as to the cellular function of the PIN domains. In thermophilic Archaea, it is reasonable to expect that there would be additional suites of DNA and RNA editing and repair mechanisms due to the elevated levels of oligonucleotide modification and damage caused by the high temperatures. Indeed, a predicted new DNA repair operon encompassing ϳ20 genes in Archaea contains a PIN domain protein (11,12). However, the augmentation of the exonucleases in mesophiles is unexpected. M. tuberculosis is the most extreme example of the currently sequenced organisms, with its 26 PIN domain proteins, all of which contain the acidic quartet of Mg 2ϩ -binding residues and the adjacent serine or threonine. This suggests that a large retinue of exo-or endonuclease enzymes are present in this pathogenic bacterium.
There has been speculation about the functional relevance of the presence or absence of DNA repair genes in M. tuberculosis.
It has been suggested that an absence of repair enzymes might be beneficial under conditions of stress or therapeutic treatment during long periods of stationary phase as is the case for M tuberculosis (36, 37). On the other hand, the correlation between those species lacking the mutS gene and thus assumed to be mismatch repair-deficient and the expansion, in the same species, of the PIN domain proteins may not be coincidental. We conclude that these PIN domain proteins are very likely substitutes for a number of exo-and endonucleases, with roles in DNA repair that are apparently missing from the M. tuberculosis genome and from the genomes of various extremophile species (12).
The second possible role for these nucleases is as a defense arsenal designed to neutralize phage via DNA and/or RNA degradation. It is the case that these functions have been adapted in eukaryotic PIN domains to degrade RNA via the RNA i and NMD pathways (10). Hence, in the eukaryotes it seems that editing and repair functions for PIN domains have likely been adapted to degradation pathways.