Evolution of Peptidase Diversity*

A wide variety of peptidases associate with vital biological pathways, but the origin and evolution of their tremendous diversity are poorly defined. Application of the MEROPS classification to a comprehensive set of genomes yields a simple pattern of peptidase distribution and provides insight into the organization of proteolysis in all forms of life. Unexpectedly, a near ubiquitous core set of peptidases is shown to contain more types than those unique to higher multicellular organisms. From this core group, an array of eukaryote-specific peptidases evolved to yield well known intracellular and extracellular processes. The paucity of peptidase families unique to higher metazoa suggests gains in proteolytic network complexity required a limited number of biochemical inventions. These findings provide a framework for deeper investigation into the evolutionary forces that shaped each peptidase family and a roadmap to develop a timeline for their expansion as an interconnected system.

A wide variety of peptidases associate with vital biological pathways, but the origin and evolution of their tremendous diversity are poorly defined. Application of the MEROPS classification to a comprehensive set of genomes yields a simple pattern of peptidase distribution and provides insight into the organization of proteolysis in all forms of life. Unexpectedly, a near ubiquitous core set of peptidases is shown to contain more types than those unique to higher multicellular organisms. From this core group, an array of eukaryote-specific peptidases evolved to yield well known intracellular and extracellular processes. The paucity of peptidase families unique to higher metazoa suggests gains in proteolytic network complexity required a limited number of biochemical inventions. These findings provide a framework for deeper investigation into the evolutionary forces that shaped each peptidase family and a roadmap to develop a timeline for their expansion as an interconnected system.
Proteolysis is a requirement for life. Some peptidases recycle polypeptides into their constitutive amino acids, whereas others catalyze selective polypeptide cleavage for post-translational modification. Approximately 700 peptidases are present in the human genome, collectively termed the degradome (1), mediating diverse processes such as extracellular matrix remodeling, immunity, development, protein processing, cell signaling, and apoptosis (2). The MEROPS data base classifies peptidases into clans based on structural similarity or sequence features suggesting such homology (3,4). Families divide clans by common ancestry. Subfamilies have common structure yet unclear ancestry. Individual peptidases are termed species. For example, clan PA peptidases contain eight families each bearing a classical two ␤-barrel architecture observed in chymotrypsin with either cysteine or serine acting as catalytic nucleophile (5). The blood clotting peptidase thrombin (6) is one species in the large S1A family of clan PA (7). Classification provides context to decipher peptidase function in the natural world.
Different classes of peptidases associate with vital biological pathways. Metallopeptidases dictate structural composition surrounding cells in multicellular organisms (8). Cysteine pep-tidases mediate regulated cell death (9). Serine peptidases enable our immune response (10). Threonine peptidases function as the central conduit for protein recycling in many organisms (11). Aspartic peptidases broker viral infection (12). However, each of these examples is a small subset within each catalytic type. To clarify peptidase diversity, we apply the MEROPS classification on a comprehensive set of genomes to visualize, compare, and contrast the proteolytic underpinnings of the biosphere. From this analysis, a map of peptidase distribution is drawn to provide a framework for decoding function, architecture, and evolution of proteolysis in vivo. Remarkably, a simple pattern is shown to underlie limited proteolysis in all forms of life.

EXPERIMENTAL PROCEDURES
Annotated protein sequences from whole genome sequence data were obtained from public ftp servers of the NCBI, DOE JGI, and select genome project homepages on the worldwide web. ClustalW was used to prepare sequence alignments from peptidase domains obtained from the MEROPS data base (3,13). The HMMER package by Eddy (14) was used to build hidden Markov models (HMM) 2 from the ClustalW prepared alignments and used to search protein sequence data using default cutoff values. HMMs were constructed for 154 peptidase families. Families were excluded from the census if they contained only a few sequences, typically less than 20, or if from viral origin. All of the data in the MEROPS data base were queried with each peptidase HMM and returned greater than 80% data base coverage indicating utility of the models (supplemental Fig. S1). An iterated search was applied on 128 genomes from all three kingdoms of life. The complete list of genomes examined, their taxonomy, and peptidase estimates are provided in supplemental Table S1. The HMMs applied capture peptidase diversity in the biosphere, yet the numbers of peptidases noted in each organism are only an estimate. Results were parsed with BioPerl (15). Cluster 3.0 and Treeview software was used to cluster census data after conversion into a binary matrix using the complete linkage algorithm and visualize the data as shown in Fig. 2 (16).

RESULTS
HMMs are useful indicators of peptidase content in a genome sequence (17,18). As many peptidase families lack known fold, we rely upon MEROPS annotation for each domain. HMMs were constructed for 154 peptidase families using the HMMER software package to probe 128 eukaryotic, eubacterial, and archaeal genomes. Results are provided in supplementary Table S1. Genomes examined represent every order of organism for which complete genome sequence data are available. Due to the tremendous diversity of peptidase families the resulting census captures trends in degradome composition and is not an exact measure. Variation in polypeptide sequence, distant phylogenetic relationships, gene loss, horizontal gene transfer, inactive peptidase sequences, and quality of the genome annotation makes precise assignment difficult (19). Hence, the tabulated numbers are only approximates of the true diversity. The HMMs distill over two million protein sequences into more than 38,000 peptidases. Plotting number of each peptidase family estimated in each genome reveals the landscape of degradome composition in known forms of cellular life (Fig. 1). Select peptidase families underwent significant expansion in copy number per genome and of these the S1A family of trypsins is most prevalent.
A landscape of genome-wide peptidase diversity is visualized from the census that enables a framework for understanding limited proteolysis in vivo. Serine peptidases are the most abun-dant class in the dataset followed by metalloproteases. Typically 100 peptidases are found in genomes of bacteria. In contrast, archaeal genomes present half that many. Explosive growth in multiple peptidase families is evident in eukaryotic organisms and the difference in degradome composition between eubacteria and eukarya is striking. Considerable work remains to detail how changes in polypeptide sequence associate with the functional gains in each of these families and further refine accuracy of the census. To derive a concise picture of peptidase diversity, we reduced dimensionality of the data.
Peptidase census results were converted into a binary matrix obviating the need for weighting functions and other advanced statistical measures. Clustering peptidase families from the binary dataset by the organisms that contain them greatly clarifies peptidase diversity (Fig. 2). The clustering algorithm, commonly applied to microarray data, segregates organisms into the three kingdoms of life and trends with accepted classification yet is entirely based upon degradome content (20 -22). Despite presenting few peptidases in their genome, the archaea group is distinctively apart from the eubacteria. Inclusion of multiple genomes from parasitic organisms within the dataset complicates phylogenetic inference from the dataset due to significant gene loss (23,24). Nevertheless, a clear pattern of peptidase distribution emerges accounting for all of the HMMs examined and nearly 90% of the families in the MEROPS data base of which 10% are viral peptidases not examined in this study.
A nearly ubiquitous core of 16 peptidase families encompasses proteolytic requirements of cellular life in a modern environment (Table 1). Given extensive distribution it is unlikely gene transfer between organisms explains this observation. Highly conserved peptidase families play central metabolic and protein processing roles such folate and glutathione metabolism, de novo purine biosynthesis, and signal peptide processing (25)(26)(27)(28). Importance of intracellular protein homeostasis is evident by the presence of nine families with known digestive function (29). Rhomboid and FtsH peptidases cleave substrates on or in a phospholipid bilayer and are highly conserved (30,31). The prolyl aminopeptidase family is associated with a variety of other chemistries outside of proteolysis including epoxide hydrolase and monoglyceride lipase activities (32). The M22 family bear homology with the DNAK/Hsp70 family  Table S1. The 10 most abundant peptidase families in the biosphere are noted. Select families of peptidases invented by the eukarya underwent significant gene duplication and divergence to dominate the degradome of complex higher metazoa. Of all peptidase families, S1A trypsins underwent the most significant expansion to yield key biological processes such as blood coagulation and immunity.
of chaperones (33). Hence, the 16 near ubiquitous families need not have been active peptidases at their origin nor throughout the dataset. The basal degradome was expanded upon through gains in intra-and extracellular peptidase activity.
Intracellular peptidase inventions in eukaryotic organisms bring about four major cellular advances in eukaryotes (Table  2). First, ubiquitin-and SUMO-mediated protein turnover emerged through invention of six peptidase families (34). The autophagins are also critical for autophagy, a cellular process requiring a complex cytoskeleton. Second, calpains are both unique to eukarya and to known peptidases for their dependence upon Ca 2ϩ for activity (35). Calpains are largely cytosolic and critical for cytoskeletal reorganization, endosome recycling, and membrane fusion (36). Third, two peptidase families are involved in structure and cell division in the nucleus. Separases divides sister chromatids during mitosis or meiosis through cleavage of cohesions and nucleoporins form an integral component of the pore separating the nucleus from the cytoplasm (37,38). Last, M76 peptidases are required for processing of mitochondrial proteins including respiratory chain complexes (39). On the basis of these intracellular advances, extracellular proteolysis expanded with the emergence of multicellularity.
A number of peptidase families involved in extracellular biology evolved in eukarya to broker key regulatory processes and set the stage for further expansion ( Table  2). Invention of novel selectivity mechanisms fostered gains in the complexity of extracellular biology. Selectivity advances include the removal of a single basic amino acid from the C terminus of a substrate (M14B family) (40), cleavage of substrates following a penultimate proline residue (S28 family) (41), and steric hindrance to a dipeptidyl peptidase-selective active site by an associated protein domain (S9B family) (42). The unusual selectivity of these novel peptidases promoted expansion of cellular communication through processing small bioactive peptides. Of eukaryotic inventions only a handful are specific to higher metazoa.
Remarkably, only 10 peptidase families are restricted to the presence in non-yeast metazoa implying late evolutionary discovery (Table 3). Of these, the C14A caspases are notable for their central role in apoptosis (43). The C14 family is the clearest example of phylogenetic distinction in a peptidase clan. Peptidases in the C14A subfamily are found in eukaryotes, whereas C14B peptidases are not. Therefore an ancestral C14B peptidase is the root of all C14A peptidases. Two additions to the ubiquitin-dependent protein recycling machinery are present in higher organisms, the Cezanne deubiquitinylating peptidases and CylD protein (44,45). Diversity of function is embodied by the M12A family of astacins, which act in processes ranging from hormonal control to extracellular matrix remodeling (46). Early association with a disintegrin domain propelled massive and rapid expansion of adamalysins to shape extracellular matrix composition and other signaling functions (47). Of late emerging peptidases, pappalysins are noteworthy given the origin of the name of this family, pregnancy-associated plasma

TABLE 1
The nearly ubiquitous core of peptidase families Sixteen peptidase families can be identified in the genomes of all forms of cellular life. The known functions of these peptidases appear to encompass all the requirements for complex proteolysis capable of digestion and protein processing. Several organisms have lost one or more of these peptidase families. Although most peptidase families can be grouped through organism-level description, several cannot due to sporadic distribution, poor identification by the HMMs applied, and rare occurrence. In certain instances, the MEROPS classification may require adjustment. For example, several larger peptidase clans with many families are likely distantly related. The common ancestor of these families lived over a billion years ago whose traces have been obliterated by the passage of time (49). Hence, phylogenetic reconstruction is challenging and demands both additional sequence information and close attention to reconcile convergent from divergent evolution. It is reasonable that the majority of peptidase families arise from divergent evolution as both the structure and enzymatic mechanism is conserved. For example, the M16 family contains three subfamilies with structural homology and weak sequence similarity (50 -52). When combined with organism-level distribution, M16B predates both M16C and M16A and is the ancestral family. Such a simple pattern is not readily discerned with most peptidase families. Of these, the S1A peptidases have contentious origin despite widespread abundance.

Family Clan
Peptidase families with mixed distribution are spread throughout the clustered data. Dispersal of these 32 peptidase families result from the diversity of both peptidases and genomes examined, gene transfer between organisms, and misidentification. Both free-living and parasitic lifestyles are also encompassed by the dataset and further complicate interpretation of the results. Of mixed peptidase families, the tricorn peptidase, C-terminal processing peptidase, and HslV component of the HslUV peptidase are notable given their importance toward intracellular protein homeostasis yet overlapping and redundant functionality (53,54). Two abundant peptidases in higher organisms, the S1A family of trypsins and S9A prolyl oligopeptidases also show a widely dispersed pattern. These peptidase families require further attention to address the interplay between gene transfer and loss.
Analysis of peptidase diversity spanning the complete biosphere introduces limitations on the results. HMMs used are derived from MEROPS annotation that return 80% coverage of the data base (supplemental Fig. S1). Usage of protein not the nucleotide sequence data continues bias of the genome annotation and quality of the underlying sequence data (55). Pseudogenes are not considered. Due to the broad nature of the HMMs applied, some peptidases identified may be non-peptidase homologs, particularly in the S33 family. Hence, the total numbers of peptidases counted are useful estimates that differ from previous studies (56 -58). These issues encourage revisions to the framework. Despite these limitations, clear trends emerge in the evolution of peptidase diversity.

DISCUSSION
Identification of a near ubiquitous core of peptidase families provides a working definition of a basal degradome and a starting point to appreciate evolutionary expansion of proteolytic  networks. Near ubiquitous peptidase families do not increase the copy number per genome over time and do not expand into different functional niches. Non-expanding, highly conserved peptidase families broker intracellular processes and absence in certain organisms suggests they can be replaced. Therefore, although the core set of peptidases fulfill vital roles they are neither malleable to the forces of molecular evolution nor location independent. Such observations are all the more remarkable given exposure to billions of years of natural selection. Degradome expansion demanded novel mechanisms of limited proteolysis particularly in extracellular environments. Eukaryotes met this requirement through a variety of protein innovations yet a surprising paucity of these novel families is unique to higher metazoa. Limited peptidase inventions in higher organisms suggest network organization, refining properties of select peptidases and enhanced spatial and temporal regulation were sufficient to generate the wide diversity of proteolysis in life. Trypsins, the S1A family, are the most abundant member in the human degradome, such as those of all non-yeast metazoa and many lower eukaryotes. Emergence of the S1A peptidase family predates major speciation events in multicellular eukaryotes, yet the origin of this family remains unclear as with several other peptidase families due to mixed distribution. Visualization of degradome diversity provides a roadmap to develop a timeline for the evolution of proteolysis at the systems level.