ACCELERATED PUBLICATION| Volume 277, ISSUE 22, P19243-19246, May 31, 2002

Substrate Recognition Drives the Evolution of Serine Proteases*

• Author Footnotes
* This work was supported in part by National Institutes of Health Research Grants HL49413 and HL58141. The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
Open AccessPublished:March 29, 2002
A method is introduced to identify amino acid residues that dictate the functional diversity acquired during evolution in a protein family. Using over 80 enzymes of the chymotrypsin family, we demonstrate that the general organization of the phylogenetic tree and its functional branch points are fully accounted for by a limited number of residues that cluster around the active site of the protein and define the contact region with the P1–P4 residues of substrate.
Phylogenetic trees of proteins are useful to map evolutionary developments and functional diversity within a given family. Progress made in the cloning and expression of a large number of autologous proteins has generated large databases that can now be used to draw quantitative conclusions on how different specificities and functions emerged during evolution. Residues that are conserved throughout the family provide important determinants of function or structural stability, but cannot be linked to the onset of functional differences or specificities because they are evolutionarily non-informative. On the other hand, residues subject to a high rate of coding nucleotide mutation are often inconsequential on function or structural stability. However, some residues must have changed during evolution to provide the branching patterns observed in phylogenetic trees. Methods that identify these residues can potentially unravel how proteins evolved in general and assist in the identification of function in newly discovered genes.
Several methods have been proposed to identify residues potentially involved in function. Some of these methods explore the functional divergence from sequence multi-alignment and phylogenetic trees (
• Gu X.
). Others combine phylogenetic and structural data (
• Lichtarge O.
• Bourne H.
• Cohen F.E.
,
• Armon A.
• Graur D.
• Ben-Tal N.
,
• Landgraf R.
• Xenarios I.
• Eisenberg D.
). Evolutionary trace (
• Lichtarge O.
• Bourne H.
• Cohen F.E.
), ConSurf (
• Armon A.
• Graur D.
• Ben-Tal N.
), and three-dimensional cluster analysis (
• Landgraf R.
• Xenarios I.
• Eisenberg D.
) methods scan for residue conservation within sub-group of proteins, pooled together within the same tree branch or filtered over identity threshold, and define surface patterns representative of the subgroup. Here we introduce an alternative method that does not require subdivision of the protein family and that takes into account not only conserved residues, but also residue replacements fixed along the transition from one branch of the evolutionary tree to other daughter branches. The method evaluates how every position of the sequence multi-alignment contributes to the architecture of the tree and provides a three-dimensional map of the regions of the protein that played a determining role during evolution.

MATERIALS AND METHODS

700 protein sequences were pulled out from the NCBI non-redundant data base (National Library of Medicine, National Institutes of Health, Bethesda, MD) by several iterations of PSI-BLAST (
• Altschul S.F.
• Schaffer A.A.
• Zhang J.
• Zhang Z.
• Miller W.
• Lipman D.J.
) using trypsin homologues as queries. All sequences were aligned with Clustal W1.8 (
• Thompson J.D.
• Gibson T.J.
• Plewniak F.
• Jeanmougin F.
• Higgins D.G.
). The best alignment obtained after different circular permutations of sequences and incorporation of the alignment from three-dimensional structure superposition was retained. Sequences were clustered by branches into 80 groups using a maximum threshold of 50% for sequence pair identity from a neighbor-junction tree performed with Clustal and accounting for 500 bootstraps (
• Felsenstein J.
). The global alignment was checked manually first among groups, then group by group, to improve gap extensions. Eighty protein sequences, one per cluster, were selected keeping human sequences whenever possible (42 out of 115 human serine proteases) or sequences from the closest species. Four sequences were also retained to cover the selection from a previous study (
• Krem M.M.
• Rose T.
• Di Cera E.
). The minimum sequence pair identity was 6% with an average of 27% and a median of 28%. Only two residues (Cys182, Gly196) are absolutely conserved over all sequences.
Bootstrap sampling was carried out from sequence alignment using Seqboot. Pairwise sequence relationships were quantified on 200 bootstraps using a distance matrix calculated with Protdist and based on the PAM1 substitution matrix (
• Felsenstein J.
). The average pair distance was 1.82. Trees were built from Fitch on the 200 bootstraps using the global rearrangement option (
• Felsenstein J.
,
• Fitch W.M.
• Margoliash E.
). The final tree was drawn using Consense (
• Felsenstein J.
). The average root distance was 1.46, the average sum of squares between sequence pair distances calculated from a tree and distances from its corresponding matrix was 43.00 and the average percent S.D. was 7.88. Seqboot, Protdist, Fitch, and Consense are components of the Phylip package, Version 3.57c (
• Felsenstein J.
).
Impact values were calculated for 230 positions corresponding to residues 16–245 of chymotrypsin using Equations Equation 1, Equation 2, Equation 3 in the text. Three-dimensional models of protein structures not available in the RCSB Protein Data Bank (
• Berman H.M.
• Westbrook J.
• Feng Z.
• Gilliland G.
• Bhat T.N.
• Weissig H.
• Shindyalov I.N.
• Bourne P.E.
) were built by homology modeling with Modeler (
• Sali A.
• Blundell T.L.
) and optimized by minimization with Discover (Accelrys, San Diego, CA). Substrate-protease models were built from the x-ray structure 1PPB and 1NRS of thrombin bound at the active site with H-d-Phe-Pro-Arg-CH2Cl (
• Bode W.
• Turk D.
• Karshikov A.
) or Leu-Asp-Pro-Arg (
• Matthews I.I.
• Ganesh V.
• Tulinsky A.
• Ishii N.
• Chen J.
• Turck C.W.
• Coughlin S.R.
• Fenton J.W.
) extended to Leu-Asp-Pro-Arg-Ser-Phe-Asn. The active site region was partitioned into specificity sub-sites from S4 through S3′. Accessible surface areas were calculated as described previously (
• Richards F.M.
).

RESULTS AND DISCUSSION

A phylogenetic tree associated with serine proteases of the chymotrypsin family is shown in Fig. 1. Sequences are grouped according to main functional properties, and their separation into branches is independent of how the root of the tree is chosen (
• Krem M.M.
• Rose T.
• Di Cera E.
). The contribution of each residue in the sequence to the definition of the tree was quantified in terms of an “impact” value. A key parameter associated with the tree is the sum of squares of the deviations between the distances of two sequences a and b in the tree, dab, and in the matrix of aligned sequences, mab, normalized by the sum of squares of the dab values, i.e.
$ς2=∑a=1N∑b=1N (dab−mab )2∑a=1N∑b=1Ndab2,$
Equation 1

where N is the number of positions in the sequence. For each position k in the sequence, the residue at that position was replaced by all 20 amino acids and a gap for all sequences in the alignment, and the resulting value of ς2 derived from the new set of mab values was calculated in each case using a PAM matrix. The dab values from the original parent tree generated by the alignment of unperturbed sequences were kept constant. The average value of ς$k2$ was expressed as the linear combination,
$ςk2=fk,Aςk,A2+fk,Cςk,C2+…+fk,Yςk,Y2+fk,gapςk,gap2$
Equation 2

where ς$k,x2$ is the value of ς2 obtained when residue at position k is replaced with the amino acid x = Ala, Cys, … Tyr, or a gap, and fk,x is the correct weighting factor associated with the substitution (
• Felsenstein J.
). The impact value of residue k is defined as follows
$Ik=exp(−‖ς2−ςk2‖)∑k=1Nexp(−‖ς2−ςk2‖)$
Equation 3

and is bound from 0 to 1. The impact of a set of p positions in the sequence is calculated from the sum of p terms as in Equation 3, divided by p. The impact value Ik is maximum when ς$k2$≈ς2, i.e. when position k contains a residue or gap that is absolutely conserved among all sequences or when the distribution of amino acids or gap substitutions at position k reproduces closely the functional branching of the parent tree. Therefore, the comparison of the residue or gap conservation at position k with the impact Ik enables the identification of residues with high impact and relatively low conservation that have been replaced during evolution to generate the functional diversity in the parent tree.
Fig. 2a shows the profile of impact values of individual positions along the sequence of the catalytic domain of a consensus serine protease. High values of the impact parameter tend to concentrate at few positions. The impact value has practically no correlation (r = 0.0, Fig. 2e) with the water accessibility of the residue. Notably, all residues with impact values >0.95 are ≥95% buried. Fig. 2,b and c, show the solvent-accessible surface of chymotrypsin color-coded according to the impact value and the residue conservation. The difference in color patterns indicates the poor correlation between impact values and residue conservation (Fig. 2,b and c, see also d). Several residues on the surface are poorly conserved (Fig. 2c, blue), but are associated with high impact values (Fig. 2b, white or red). Invariant or quasi-invariant positions such as the catalytic His57 or Ser195 (Fig. 1) have impact values close to 1, as expected from Equations 2 and 3. Notably, residues with high impact tend to concentrate at the active site.
The distribution of impact values in Fig. 2 suggests that most of the information necessary to construct the phylogenetic tree in Fig. 1 is concentrated in a few sites of the protein. This prompted an investigation of the smallest set of residues capable of reproducing the phylogenetic tree in its entirety. Impact values were calculated for secondary structural elements of the protease domain. The impact value of individual β-strands ranged from 0.40 to 0.64 (Fig. 3, b1 to b12). Helices are not conserved among serine proteases and were grouped for the analysis with loops connecting β-strands. These elements show impact values ranging from 0.41 to 0.67 (Fig. 3, a1 to a10). In no case, however, could the tree in Fig. 1 be reconstructed using single secondary structural elements, either β-strands, helices, or loops.
Next we calculated the impact values of patches of accessible surface area ranging 1100–1300 Å2 in size and plotted them versus the average conservation of residues (Fig. 3). More than 100 patches were scanned around each of the water-accessible residues on the surface of the protein. The area with maximum impact clearly stood out from the rest and involved the specificity sub-sites S1–S4. This area encompasses the smallest set of residues capable of reproducing the phylogenetic tree in Fig. 1 in its entirety. Remarkably, the resulting mab values obtained from the S1–S4 sub-sites give a better approximation of the phylogenetic tree in Fig. 1 than the entire sequence of the enzyme or any combination of secondary structural elements. We have recently shown that the 50 residues of the C-terminal domain of serine proteases are necessary and sufficient to reproduce the architecture of the parent tree in Fig. 1 (
• Krem M.M.
• Rose T.
• Di Cera E.
). Twenty out of 30 residues defining the S1–S4 sub-sites are comprised within the 50 C-terminal residues. We conclude that substrate recognition at the S1–S4 sub-sites dominates the evolution and functional diversity of serine proteases.
The definition of the impact value in Equation 3 is a powerful tool to identify sets of residues specifying functional branching in a phylogenetic tree. The information derived from this approach should assist in the rational engineering of enzymes with altered specificity by pointing to regions that likely hold structural determinants of function. In addition, the profile of impact values in a newly discovered gene in a protein family reveals targets for mutagenesis from which the function of the protein may be identified.

REFERENCES

• Gu X.
Mol. Biol. Evol. 1999; 16: 1664-1674
• Lichtarge O.
• Bourne H.
• Cohen F.E.
J. Mol. Biol. 1996; 257: 342-358
• Armon A.
• Graur D.
• Ben-Tal N.
J. Mol. Biol. 2001; 307: 447-463
• Landgraf R.
• Xenarios I.
• Eisenberg D.
J. Mol. Biol. 2001; 307: 1487-1502
• Altschul S.F.
• Schaffer A.A.
• Zhang J.
• Zhang Z.
• Miller W.
• Lipman D.J.
Nucleic Acids Res. 1997; 25: 3389-3402
• Thompson J.D.
• Gibson T.J.
• Plewniak F.
• Jeanmougin F.
• Higgins D.G.
Nucleic Acids Res. 1997; 25: 4876-4882
• Felsenstein J.
Phylip package, Version 3.57c. University of Washington, Seattle, WA1994
• Krem M.M.
• Rose T.
• Di Cera E.
J. Biol. Chem. 1999; 274: 28063-28066
• Felsenstein J.
Methods Enzymol. 1993; 266: 418-427
• Fitch W.M.
• Margoliash E.
Science. 1967; 155: 279-284
• Berman H.M.
• Westbrook J.
• Feng Z.
• Gilliland G.
• Bhat T.N.
• Weissig H.
• Shindyalov I.N.
• Bourne P.E.
Nucleic Acids Res. 2000; 28: 235-242
• Sali A.
• Blundell T.L.
J. Mol. Biol. 1993; 234: 779-815
• Bode W.
• Turk D.
• Karshikov A.
Protein Sci. 1992; 1: 426-471
• Matthews I.I.