Evolutionary Divergence of Substrate Specificity within the Chymotrypsin-like Serine Protease Fold*

One of the enduring paradigms in enzymology is the theme of evolutionary divergence in substrate specificity from a parental enzyme possessing a prototypical fold. Within the fold of each divergent enzyme is embedded the catalytic machinery essential to providing rate enhancement for an identical reaction, and the particular amino acids directly involved in this function are expected to be highly conserved and invariant in their spatial positions. Among the remaining residues will be those required for the overall three-dimensional structure. Such residues are found largely within the hydrophobic core of the enzyme and, within a family, may show covariance according to the thermodynamic determinants specifying stability and the informational determinants specifying uniqueness of the fold. The remaining amino acids of the enzymes then constitute much of the raw material for evolutionary adaptation to novel selectivity. Many of these can be expected to be found interacting with distal, variable, specificitydetermining portions of the substrates in modes that are unique to each enzyme of the family. The core structural residues may also contribute to generating new specificity, by virtue of second or third shell interactions with the amino acids directly contacting substrate. Here we describe structure-function relationships in one of the largest and most comprehensively studied of all enzyme families, the serine proteases with a chymotrypsin fold. Over 20 unique structures have been determined to date, and the number of available sequences exceeds 500 (for a comprehensive review, see Ref. 2). All of the enzymes possess an identical fold consisting of two b-barrels, with the catalytic Ser, His, and Asp amino acids found at the interface of the two domains. Common features present in all structures, including five enzyme-substrate hydrogen bonds at positions P1 and P3, serve to properly juxtapose the scissile peptide bond adjacent to the Ser-His catalytic couple, such that the nucleophilic Ser O-g is accurately positioned for attack. Given these conserved interactions in the direct vicinity of the reacting groups, the more distal contacts sharply diverge. We describe the structural themes that embody this divergence together with some of the most important enzymological data necessary to the structure-function correlation. One important theme that emerges is that of catalytic register (3); the divergence in distal interactions in different enzymes must be such as to still permit accurate substrate alignment. Another observation is that many structural determinants controlling specificity reside on surface loops, and this allows for the possibility of rapid and varied evolutionary divergence with conservation of the overall tertiary fold. Divergence of substrate specificity within the context of a common structural framework represents only one mechanism by which nature is able to evolve new activities. Another mechanism involves incorporation of new catalytic groups within an active site, such that the same scaffold can carry out a range of chemistries with just a single step common to different members of the family. This theme is elaborated in the review by Babbitt and Gerlt (4). It is also possible to obtain information on the requirements for a common catalytic function by studying examples of convergent evolution, where the same chemical reaction is carried out by very different scaffolds. Serine proteases represent a paradigm in this respect as well; besides the chymotrypsin-fold enzymes, there are now four other known natural folds that possess the requisite catalytic determinants in similar spatial positions (5–10). Study of these structures and redesigned versions of trypsin (11) shows that the catalytic Asp can adopt virtually any position with respect to the Ser-His dyad, suggesting that the classical “catalytic triad” of Ser, His, and Asp residues (Fig. 1) can in fact be better described as the juxtaposition of two dyads: Ser-His and His-Asp (6).

One of the enduring paradigms in enzymology is the theme of evolutionary divergence in substrate specificity from a parental enzyme possessing a prototypical fold. Within the fold of each divergent enzyme is embedded the catalytic machinery essential to providing rate enhancement for an identical reaction, and the particular amino acids directly involved in this function are expected to be highly conserved and invariant in their spatial positions. Among the remaining residues will be those required for the overall three-dimensional structure. Such residues are found largely within the hydrophobic core of the enzyme and, within a family, may show covariance according to the thermodynamic determinants specifying stability and the informational determinants specifying uniqueness of the fold. The remaining amino acids of the enzymes then constitute much of the raw material for evolutionary adaptation to novel selectivity. Many of these can be expected to be found interacting with distal, variable, specificitydetermining portions of the substrates in modes that are unique to each enzyme of the family. The core structural residues may also contribute to generating new specificity, by virtue of second or third shell interactions with the amino acids directly contacting substrate.
Here we describe structure-function relationships in one of the largest and most comprehensively studied of all enzyme families, the serine proteases with a chymotrypsin fold. Over 20 unique structures have been determined to date, and the number of available sequences exceeds 500 (for a comprehensive review, see Ref. 2). All of the enzymes possess an identical fold consisting of two ␤-barrels, with the catalytic Ser 195 , His 57 , and Asp 102 amino acids found at the interface of the two domains. Common features present in all structures, including five enzyme-substrate hydrogen bonds at positions P1 and P3, 1 serve to properly juxtapose the scissile peptide bond adjacent to the Ser-His catalytic couple, such that the nucleophilic Ser 195 O-␥ is accurately positioned for attack. Given these conserved interactions in the direct vicinity of the reacting groups, the more distal contacts sharply diverge. We describe the structural themes that embody this divergence together with some of the most important enzymological data necessary to the structure-function correlation. One important theme that emerges is that of catalytic register (3); the divergence in distal interactions in different enzymes must be such as to still permit accurate substrate alignment. Another observation is that many structural determinants controlling specificity reside on surface loops, and this allows for the possibility of rapid and varied evolutionary divergence with conservation of the overall tertiary fold.
Divergence of substrate specificity within the context of a com-mon structural framework represents only one mechanism by which nature is able to evolve new activities. Another mechanism involves incorporation of new catalytic groups within an active site, such that the same scaffold can carry out a range of chemistries with just a single step common to different members of the family. This theme is elaborated in the review by Babbitt and Gerlt (4). It is also possible to obtain information on the requirements for a common catalytic function by studying examples of convergent evolution, where the same chemical reaction is carried out by very different scaffolds. Serine proteases represent a paradigm in this respect as well; besides the chymotrypsin-fold enzymes, there are now four other known natural folds that possess the requisite catalytic determinants in similar spatial positions (5-10). Study of these structures and redesigned versions of trypsin (11) shows that the catalytic Asp can adopt virtually any position with respect to the Ser-His dyad, suggesting that the classical "catalytic triad" of Ser, His, and Asp residues ( Fig. 1) can in fact be better described as the juxtaposition of two dyads: Ser-His and His-Asp (6).

Structural and Functional Determinants of S1 Site Specificity
The S1 site specificity of enzymes with chymotrypsin folds has been subjected to considerable evolutionary refinement. The classical paradigm for understanding specificity at this position, derived from early crystallographic studies of chymotrypsin, elastase, and trypsin (12), is that the amino acids immediately contacting substrate fully determine the observed selectivities for large hydrophobic, small hydrophobic, and basic residues, respectively. This proposal arose from the observation that the S1 pockets of the enzymes are highly complementary to these amino acids. Thus, trypsin possesses a deep cylindrical pocket punctuated at its base by Asp 189 , chymotrypsin an analogously shaped cavity lacking the negative charge, and elastase a shallow hydrophobic depression. In * This minireview will be reprinted in the 1997 Minireview Compendium, which will be available in December, 1997. This is the first article of two in the "Minireview Series on Enzyme Superfamilies." This work was supported by National Science Foundation Grant MCB9604379 (to C. S. C.). 1 See legend to Fig. 1 for nomenclature.
FIG. 1. Chemical mechanism of catalysis for serine proteases. Catalytic groups of chymotrypsin-fold enzymes are shown interacting with a peptide substrate binding to the P1 site (Nomenclature for the substrate amino acid residues is Pn, . . . ,P2,P1,P1Ј,P2Ј, . . . ,PnЈ, where P1-P1Ј denotes the hydrolyzed bond. Sn, . . . ,S2,S1,S1Ј,S2Ј, . . . ,SnЈ) denotes the corresponding enzyme binding sites (1).) Five enzymesubstrate hydrogen bonds at positions P1 and P3 are shown in addition to hydrogen bonds among the members of the catalytic amino acids. each case the walls of the pocket are formed by three ␤-strands connected by two surface loops and the disulfide bond Cys 191 -Cys 220 (Fig. 2). Amino acid side chains at positions 190 and 228 extend into the base of the pocket and modulate the specificity profile, whereas residues 216 and 226 are additional primary determinants located on one wall. These amino acids are Gly in enzymes exhibiting trypsin or chymotrypsin specificities. The glycines permit access of the large substrate side chains to the base of the pocket. In enzymes exhibiting elastase specificity, larger, usually nonpolar residues are present and provide a platform for interaction with small hydrophobic P1 substrate side chains.
This standard model is incomplete, based in part on recent studies of trypsin (3,(13)(14)(15)(16)(17)(18)(19)(20). Amino acid substitutions in the primary binding pocket at positions 216 and 226 can have a dramatic effect on the specificity and catalytic efficiency of trypsin due to improper register of the substrate with the catalytic machinery (3). Although Asp 189 at the base of the P1 pocket is a primary determinant of arginine and lysine specificity, substitution of this residue with other amino acids does not alter substrate specificity (13,14,20). Even substitution of all the amino acids in the S1 site of trypsin with their counterparts in chymotrypsin fails to transfer chymotryptic specificity to the variant enzyme (15). Transfer of specificity requires the additional exchange of amino acids in at least three distal segments of the enzyme, none of which directly contacts substrate. The exchange of surface loops 1 and 2 ( Fig. 2) is sufficient to confer high acylation rates toward P1-Phe substrates (15), and the further mutation Y172W in the third surface loop 3 improves binding by 50-fold. The overall catalytic efficiency of the mutated enzyme is roughly 10% of that exhibited by chymotrypsin (16).
These mutational experiments of trypsin show that evolutionary refinement of S1 site specificity in this family can involve modulation by distal enzyme segments not contacting substrate. However, other members of the family respond quite differently to localized mutagenesis. Two other well studied examples are the invertebrate fiddler crab collagenolytic serine protease I (21) and the microbial ␣-lytic protease (22). Collagenase is unique in its known properties by virtue of an exceptionally broad specificity encompassing the recognition properties of trypsin, chymotrypsin, and elastase. This is a consequence of a uniquely shaped S1 site pocket, which is enlarged by the insertion of several amino acids between Gly 216 and the disulfide bond. A co-crystal structure of the enzyme complexed to the dimeric protein inhibitor ecotin suggested multiple distinct binding modes for the different P1 substrate amino acids (23). By contrast, ␣-lytic protease possesses an elastase-like specificity profile favoring small hydrophobic residues, and its crystal structure shows that the S1 site is a shallow hydrophobic depression (24). However, in each case, directed mutagenesis of a small number of amino acids directly contacting substrate produced large changes in the P1 specificity profile, while maintaining high catalytic efficiency (22,23). This appears to demonstrate a less important role for distal residues than in trypsin and suggests that these two enzymes may more closely approach the original paradigm.
The rationale for this disparate behavior may have its origins in the evolution of bacteria and mammals to fill different biological niches. Whereas a comprehensive analysis must await work on a broader array of enzymes, it appears that organisms which diverged earlier, such as prokaryotes and invertebrates, possess broadly specific serine proteases capable of processing a wide variety of substrates. By contrast, mammals possess multiple distinct enzymes, which together fulfill the function of the single enzyme in earlier evolved species. Thus, the mammalian enzymes can diversify to fill very specific roles. One possible rationale for this observation could be that organisms which diverged earlier have been selected to possess a smaller genome, so that restrictions on size prevent initiating the process of gene amplification and subsequent divergence. Complex multicellular organisms have a greater requirement for diverse function in specific cellular contexts, leading to a selection toward very specific function.
In the course of divergence, mammalian serine proteases lose their broadly specific properties through the accumulation of mutations contributing to a much narrower selectivity. It is, therefore, not surprising that mammalian enzymes are often not readily engineered to new specificities; so many mutations have accumulated that the structural properties of the scaffold have been altered in a fundamental way, even though the overall fold is retained. Mutations introduced at the specificity-determining position 189 in other mammalian proteases of the family produce variant enzymes, which behave similarly to trypsin. 2 However, it should be noted that there are also examples of functionally and structurally related mammalian enzymes that can be readily reengineered. The lactate and malate dehydrogenases, for example, can be interconverted with just a single amino acid substitution (32). The origins of this different behavior toward mutation must reflect different structural properties of the dehydrogenases relative to the mammalian serine proteases, and these are also subject to evolutionary optimization.
A structural rationale for the different properties of the scaffold in early and late evolved enzymes of the family has emerged from crystallographic study of variants in ␣-lytic protease and trypsin. Crystal structures of the trypsin variant into which was substituted the S1 site and loops 1 and 2 of chymotrypsin (trypsin 3 chymotrypsin (S1 ϩ L1 ϩ L2)) and of a similar enzyme possessing the further Y172W replacement show that the detailed conformation of the polypeptide backbone at the conserved Gly 216 is correlated with high acylation rates of both variants toward P1-Phe substrates (19). Gly 216 forms two hydrogen bonds with the substrate P3 residue in all enzymes of the family (Fig. 1), and these enzyme-substrate interactions are required to achieve full catalytic potency and to obtain a maximal level of P1 site discrimination among alternative amino acids (15,16). It appears that the S1 site specificity in trypsin must be viewed as a more distributed property of the fold and not as the province of a few amino acids that interact directly. In this enzyme, achieving catalytic register of enzyme and substrate requires both proximal and distal residues. This was futher shown by engineering a metal-dependent substrate binding site in trypsin that delocalized specificity by increasing the relative substrate binding contributions from the alternate engineered site (20).
The crystallographic analysis of ␣-lytic protease variants complexed with peptidyl boronic acid transition state analogs varying at their P1 amino acids reveals that the S1 site of this homologous enzyme has quite different structural properties. Enlargement of the enzyme pocket via the substitutions M190A or M213A greatly broadens the specificity profile to include residues as large as P1-Phe, with k cat /K m increased by up to 15-fold relative to wild-type cleavage at P1-Ala. Crystal structures of the variants showed that the principal rationale for the broadened specificity is structural plasticity of the S1 site, which encompasses a combination of alternate side chain conformations as well as deformability of the 2 C. S. Craik, unpublished results. main chain (22). In both cases, the broad specificities appear to depend on the ability of the main chain and side chain atoms, at positions 216 and the following inserted loop 217-220, to readjust positions. Thus, as in the case of trypsin, specificity is correlated with the structure of a portion of the enzyme that binds the distal P3 residue of substrates, where the crucial intermolecular hydrogen bonds are made. A key distinction, however, is that there is little dependence of activity on the precise interactions of the P1 substrate side chain in the enzyme S1 site. Thus, a compelling rationale for the high catalytic activity of ␣-lytic protease variants on novel substrates is that the backbone at position 216 can adjust to compensate for differences in the P1-S1 interactions, so that the substrate scissile bond is still in catalytic register with respect to Ser 195 and His 57 . This is in sharp contrast to trypsin, where there is no movement at Gly 216 in response to mutation in the S1 site (18,25).
Because the specificity profile of the ␣-lytic protease S1 site is easily modified by local mutation, it might be predicted that the distal elements play little or no role in modulating it. However, mutagenesis of loop 2 of the enzyme, which is greatly enlarged relative to loop 2 of trypsin (Fig. 3), shows that this segment does have significant effects on relative k cat /K m values among the different P1-Phe amino acids (26). The effects vary depending on the precise position of the mutation within the loop, with amino acids closer to the S1 site showing larger differences from wild type. Thus, substrate specificity in ␣-lytic protease as well as in trypsin is better viewed as a distributed property of the fold. Clearly, the ability to easily modify specificity by mutation of directly interacting residues does not exclude a role for distant amino acids. Other enzymes in which mutation of distal residues affects specificity are also known, for example, isocitrate dehydrogenase (33) and triosephosphate isomerase (34). Evolution can create peripheral, readily variable structures around a binding site to modulate dynamic as well as static aspects of the enzyme-substrate interaction.
The difference in inherent flexibility found in the S1 sites of ␣-lytic protease and trypsin allows us to infer how evolution optimizes this important specificity-determining property of an enzyme active site. It appears to do so by modulating how the S1 site is tethered to surrounding segments, which have the capacity to diverge substantially in structure because they form surface loops and are not key elements of the fold. Thus, the early evolved, broadly functional ␣-lytic protease possesses a scaffold constructed to allow flexibility during substrate binding, and this property leads to both broad specificity of the wild-type enzyme and ease of specificity modification by mutation. By contrast, the structurally homologous trypsin scaffold possesses the opposite properties, as required by its occupancy of a late evolved and highly specific evolutionary niche. The comparative analysis of these two enzymes has opened up the possibility of experiments aimed at achieving a much deeper understanding of the underlying structural origins of flexibility. In this regard it is noteworthy that the addition of several mutations at Gly 216 of ␣-lytic protease (G216A and G216L) to the original M190A variant (22) virtually eliminated the active site plasticity of the latter (27). This shows that binding site plasticity can be modulated by just a single amino acid.

Role of Surface Loops in Determining
Subsite Specificity We consider now some examples of evolutionary divergence at positions more distal to the scissile bond. Crystal structures of various enzymes complexed to peptide and protein inhibitors have produced models of extended peptide binding at sites extending from as far as P7 to P4Ј. However, the detailed path of the substrate outside of the crucial P3-P1Ј region diverges among different members of the family, reflecting specific biological roles. For example, serine proteases involved in fibrinolysis (28), in blood coagulation (29), and in the complement system (30) are all members of the family possessing the chymotrypsin fold. These enzymes recognize specific cleavage sites and are themselves the targets of protein inhibitors, which play key roles in regulating these processes. Substantial diversity in the details of these interactions is expected, but because of the structural similarities, an overall framework exists to understand each example.
Seven separate surface loops have been implicated in distal site substrate or inhibitor contacts with different enzymes of the family. Five of these are designated as loops A-E (Fig. 3). The remaining two (loops 2 and 3 (Figs. 2 and 3)) appear able to function in either subsite or S1 site interactions. As described above for the S1 site, the structural basis for evolutionary adaptation and creation of new specificities lies in the fact that the loops are not core elements of the fold. There are examples where adjacent loops interact closely with one another, apparently functioning together to determine substrate specificity. Therefore, whereas evolutionary adaptation provides loop structures capable of diverse binding interactions directly with substrates or inhibitors, it may simultaneously optimize loop-loop intramolecular contacts to produce a contiguous surface for presentation to targets.
One example of extended binding site specificity is provided by the enzyme enteropeptidase, which functions in vivo to cleave trypsinogen at position Ile 16 , generating the new N-terminus required for trypsin activity (35). The enzyme is capable of cleaving the (Asp) 4 -Lys sequence in trypsinogen with a catalytic efficiency roughly 10 4 -fold greater than trypsin. Whereas the structure of this enzyme has not yet been determined, mapping the sequence onto the common chymotrypsin-like fold indicates that the peptide Lys 96 -Arg 97 -Arg 98 -Lys 99 (KRRK) in loop C is well positioned to play a direct role in interacting with the negatively charged aspartates occupying positions P2-P5 (36). Further inspection of the sequence alignments reveals differences at positions 215-219 at the lip of the S1 site (loop 2), as well as an inserted residue in loop 3 relative to trypsin (Fig. 3). Each of these features may be of importance to precise orientation of the (Asp) 4 -Lys substrate, possibly by specifying the conformation of Gly 216 . Additionally, enteropeptidase possesses a striking 10-residue insertion between residues 58 and 59 in the surface loop B, which lies directl;y behind the KRRK sequence of loop C. Whereas loops B and C do not contact each other in many other enzymes of the family, the large loop B in enteropeptidase would be capable of making interactions conceivably of importance to maintaining correct orientation of the KRRK residues.
A second example of the role of surface loops in generating subsite specificity comes from the study of fiddler crab collagenolytic serine protease I. The ability of this enzyme to cleave native type I triple helical collagen is a property not exhibited by either the canonical mammalian or the microbially derived chymotrypsin-fold serine proteases. Incubation of the enzyme with type I collagen substrates produces signature cleavages at a number of sites located roughly three-quarters of the distance along the 1000-amino acid chains (21). This property identifies it as a true collagenase, as distinct from gelatinases that cleave only denatured collagen (gelatin). Since the enzyme possesses the usual 30 -35% sequence similarity with other members of the family, it was immediately apparent that unique structural determinants must be present to confer cleavage specificity toward the large triple helical substrate.
The crystal structure of crab collagenase complexed with the protein inhibitor ecotin reveals a large, pronounced groove running across the enzyme surface and including the catalytic site (31). Molecular modeling was employed to show that the large groove is readily able to accommodate the three chains of collagen. A detailed energy-minimized model of the triple helix was then built into collagenase, which predicts key structural features of the interaction. Five surface loops contribute to the construction of the large binding groove on the enzyme. The polypeptide chain of collagen, which is to be cleaved, binds the loops 2 and 3 at its N terminus, providing unique P5-P7 interactions. On the C-terminal side of the scissile bond, the cleaved strand interacts with surface loop A, which forms one wall of the groove at this position. These interactions help to constrain the rotational freedom of collagen about its long axis. The two noncleaved strands interact with both the pronounced groove on the C-terminal side and a shallow bowl bridging this groove with the N-terminal extended peptide binding site. The bowl is formed from the interactions of surface loop D with loop 2, the sides of the C-terminal groove are constructed from loops E and A, and the floor of this groove consists of amino acids in core ␤-strands of the structure.
Among enzymes of known structure that have been tested for the ability to specifically cleave type I collagen, which includes trypsin, chymotrypsin, and human neutrophil elastase in addition to collagenase, only the invertebrate enzyme is active (21). This correlates with structure, in that none of the former enzymes possesses a marked binding groove. Human neutrophil elastase, which is able to cleave type III collagen, possesses a much less well defined groove. Possibly the type III collagen helix is deformed at a lower free energy cost than is type I and so requires fewer compensating interactions with an enzyme binding site. Further structures of collagenolytic enzymes will help in defining particular features of the binding grooves and of the surface loops, which are most important to defining the collagenolytic specificity.

Summary
We have described a structural scaffold extant in over 500 homologous enzymes, where the catalytic mechanism is identical and the substrate specificity varied. The many known structures of chymotrypsin-fold serine proteases delineate a clear framework showing that this variability is a function of evolved diversity in the structures of surface loops, which surround the extended substrate binding site. Because the loops are conserved in their relative positions with respect to the N to C direction of the scissile peptide, general themes for their individual functions can be described. Thus, loop C is invariably positioned to directly contact the extended substrate on the N-terminal side of the scissile bond, whereas loops A, B, D, and E interact on the leaving group side (Fig. 3). Loops 2 and 3 can modulate either or both of the S1 sites or the subsite preferences on the N-terminal side of the scissile bond. Loop 1 is thus far known to determine only the S1 site specificity.
The extensive investigations of two particular enzymes of the family have been rewarded by substantial insights into the structural rationale by which specificity at one of these sites may be either broad or narrow. From these studies we can postulate that enzymes which evolved early are less diversified and thus broadly specific in their P1 substrate preference, and this functional property is correlated with a flexibility in the protein structure. By contrast, late evolving enzymes occupy very specific evolutionary niches, and their narrow substrate specificities arise from a rigid S1 site structure. Further characterization of the various members of this family is necessary to test this hypothesis. Inherent flexibility of structure and its correlation with ease of specificity mod-ification by mutagenesis provide a theoretical insight, which should be of value in designing mutational strategies for both serine proteases and other enzymes. Since the specificity-determining surface loops appear not to be key structural elements, the scaffold of chymotrypsin-like proteases may be an excellent choice for the engineering of novel site-specific proteases possessing a range of useful properties.