The Structure of Leishmania mexicana ICP Provides Evidence for Convergent Evolution of Cysteine Peptidase Inhibitors*

Clan CA, family C1 cysteine peptidases (CPs) are important virulence factors and drug targets in parasites that cause neglected diseases. Natural CP inhibitors of the I42 family, known as ICP, occur in some protozoa and bacterial pathogens but are absent from metazoa. They are active against both parasite and mammalian CPs, despite having no sequence similarity with other classes of CP inhibitor. Recent data suggest that Leishmania mexicana ICP plays an important role in host-parasite interactions. We have now solved the structure of ICP from L. mexicana by NMR and shown that it adopts a type of immunoglobulin-like fold not previously reported in lower eukaryotes or bacteria. The structure places three loops containing highly conserved residues at one end of the molecule, one loop being highly mobile. Interaction studies with CPs confirm the importance of these loops for the interaction between ICP and CPs and suggest the mechanism of inhibition. Structure-guided mutagenesis of ICP has revealed that residues in the mobile loop are critical for CP inhibition. Data-driven docking models support the importance of the loops in the ICP-CP interaction. This study provides structural evidence for the convergent evolution from an immunoglobulin fold of CP inhibitors with a cystatin-like mechanism.

groups of CP inhibitors (7). This raises the intriguing question: have the different groups of CP inhibitors evolved related tertiary structures to interact with the same target CPs in a similar fashion, as predicted from in silico analysis by Rigden et al. (8), or does ICP act in a different way?
ICP family members appear to occur in species from a very limited phylogenetic range (some parasitic protozoa, bacteria (including Pseudomonas aeruginosa), and Archaea, but no metazoa), suggesting that the genes have been acquired by horizontal gene transfer or retained for some special function. Although there is good evidence that the in vivo target for ICPs are clan CA, family C1 CPs, it remains uncertain whether the main function of the inhibitors is to modulate the activity of enzymes of the parasite itself (as is suggested for protozoan parasite Trypanosoma cruzi (9)) or the host (as suggested for the related parasite Leishmania (10)). Interestingly, no clan CA, family C1 CPs appear to be present in the P. aeruginosa genome, which provides further support for the suggestion that one role of ICPs in pathogens might be to regulate host CP activity and so facilitate infection.
This has been investigated with Leishmania by gene targeting to create parasite lines that either lack or overexpress the ICP gene (10). ICP null mutants grow normally axenically in vitro and are as infective to macrophages in vitro as wild type parasites. However, they have reduced infectivity to mice. Lines that overexpress ICP also show markedly reduced virulence in vivo. Thus, ICP may be important after it is released in the interaction of the parasite with its mammalian host or as a mechanism of preventing damage from host CPs taken up by the parasite through endocytosis (e.g. from the parasitophorous vacuole, which contains host lysosomal enzymes).
The ICP proteins are small (ϳ13 kDa) and very divergent, with L. mexicana ICP having only 31% identity with ICP of T. cruzi and 24% identity with ICP of P. aeruginosa (11). Nevertheless, there are highly conserved motifs that suggest important functional regions. This has facilitated the identification of predicted ICPs from genome data and recombinant ICPs have been produced from the L. mexicana, T. brucei, Entamoeba histolytica, and P. aeruginosa genes and confirmed to have potent inhibitory activity toward CPs, notably cathepsin L homologues (7,11,12) To date, the structural basis of the inhibitory activity of ICP is unknown. Previous threading studies have suggested that the binding site of T. cruzi ICP may be located on the loops between ␤-strands in a fold that resembles immunoglobulin light chain variable domains (8,13). Another study drew parallels between the sequence conservation in predicted loops of the ICP family and the peptidase-binding regions of the cystatin family (12). We have now determined the structure of L. mexicana ICP in solution by NMR spectroscopy, confirmed residues key for its inhibitory activity using site-directed mutagenesis, and investigated how the key residues may bind to the model clan CA, family C1 peptidase papain, and an important L. mexicana CP, known as CPB (14).

EXPERIMENTAL PROCEDURES
Protein Production-Recombinant L. mexicana ICP was expressed from a pET28 (Novagen)-derived plasmid in Escherichia coli BL21 (DE3) cells as described previously (11). 15 N, 13 C-labeled protein was produced by growing the cells in M9 medium using 15 NH 4 Cl and [ 13 C]glucose (Spectra Stable Isotopes) as the sole nitrogen and carbon sources. The fusion protein was purified by nickel chelate chromatography and digested with thrombin (Novagen). The cleaved histidine tag and thrombin were removed by nickel chelate and benzamidine-Sepharose (Sigma) affinity chromatography. The protein comprising the complete native sequence (Q868H1 3 ;CAD68975) with the addition of three residues (GSH) at the N terminus (designated ICP-(Ϫ2-113)) was buffer-exchanged into 25 mM sodium phosphate, pH 4.5, 50 mM NaCl, 0.001% NaN 3 by extensive diafiltration using a 5,000 MWCO centrifugal concentrator (Vivascience) and concentrated to ϳ1 mM. D 2 O was added to a final concentration of 10% (v/v).
Interaction studies were carried out using papain from Papaya latex (Sigma) and L. mexicana CPB2.8⌬ CTE, produced as described previously (14). In each case, peptidase was mixed with an excess of 15 Nlabeled ICP in NMR sample buffer, and the complex was isolated by gel filtration on a Superdex 75 HR10/30 column (APBiotech) and then concentrated using a 10,000 MWCO centrifugal concentrator.
NMR Spectroscopy and Data Analysis-Resonance assignments were determined using standard triple resonance NMR techniques and have been deposited as described (15). Distance restraints for structure calculation were derived from three-dimensional 15 N and 13 C HSQC-NOESY spectra recorded with 100 ms mixing times recorded on an 800 MHz Bruker Avance spectrometer. Slowly exchanging amide protons were identified by redissolving a lyophilized sample in D 2 O and recording a series of 15 N HSQC spectra. Spectra were processed with AZARA 4 and analyzed using CCPN analysis (16).
Structure Calculation-Assigned, partially assigned, and ambiguous NOESY cross-peaks were used to generate distance constraints within CCPN analysis that were exported directly to CNS/XPLOR format and used as input for structure calculations using CNS v1.1 (17) using a modified version of the PARALLHDG 5.3 forcefield (18) with IUPACrecommended nomenclature (19). Structures were generated from random atomic coordinates following the scheme of the rand, dgsa, and refine scripts from the XPLOR 3.1 manual (20) reimplemented in CNS and modified to incorporate floating chirality at prochiral centers (21) using a metropolis acceptance criterion. The tools provided by the ARIA (22) module within CNS were used to identify consistently violated restraints for checking, to reduce the ambiguity of ambiguous distance restraints based on the ensemble of structures calculated, to remove duplicate restraints, and to recalibrate the distance-intensity mapping. Distance restraints representing hydrogen bonds were incorporated for slowly exchanging amides where corroborating NOEs existed and an acceptor atom could be unambiguously identified. Atomic coordinates have been deposited at the PDB (PDB code 2c34), and structural statistics are summarized in Table 1. 15 N Relaxation Measurements-Protein dynamics were probed through 15 N relaxation measurements of the backbone amide groups. 15 N T 1 , T 2 , and 1 H, 15 N-heteronuclear NOE experiments were recorded at 600.13MHz ( 1 H), and the data were analyzed according to the Lipari-Szabo model-free formalism using the programs curvefit and modelfree (23).
Docking-Data-driven molecular docking was carried out using HADDOCK v1.3 (24) using the default parameters and using chemical shift perturbation data as input. In the HADDOCK terminology, the "active" ICP residues are those whose backbone or side-chain amide chemical shifts are significantly perturbed in the complex with papain and which are surface-exposed. The "passive" ICP residues are their immediate, surface-exposed, neighbors. In the absence of chemical shift perturbation data for the peptidase, and given the superficial similarity between the ICP and stefin B structures, the "active" papain residues were defined as those residues in close contact with stefin B in the crystal structure of the stefin B⅐papain complex. The passive papain residues are their immediate, surface-exposed neighbors. "Semiflexible" regions are defined to be two residues either side of the active and passive residues in the primary sequences of both proteins. The ICP D-E loop was made "fully flexible" to reflect its high mobility in unbound ICP.
Mutations-Mutations of L. mexicana ICP were incorporated into the expression vector using the QuikChange site-directed mutagenesis kit (Stratagene) and the following pairs of complementary primers (mutated sites in lowercase): NT300, CCCGACCACTGGAgc-CATGTGGACGCGC and NT301, GCGCGTCCACATGgcTCCA-GTGGTCGGG to generate pBP191 (encoding Y34A); NT282, CA-TCCTCGACTCCTgacGacGGAGaTGGTGGCATCTAC and NT283, GTAGATGCCACCAtCTCCgtCgtcAGGAGTCGAG-GATG to generate pBP193 (encoding M64D, V68D, V70D); NT298, CCTCGACTCCTATGGTGccAGTTccTcccATCTACGTTGTG-CTCG and NT299, CGAGCACAACGTAGATgggAggAACTggCA-CCATAGGAGTCGAGG to generate pBP194, (encoding G69P, G71P, G72P); NT284, CTGGTCTACACGgcCCCCgcCGAGGGC-3 UniProt number. 4 Available at www.bio.cam.ac.uk/azara.  K i Determinations-K i s for L. mexicana ICP and its mutants against L. mexicana CPB2.8⌬ CTE were determined essentially as described previously (11). In summary, concentrations of ICP stock solutions were determined by titration against a known concentration of CPB2.8⌬ CTE in a colorimetric assay using N-benzoyl-PFR-p-nitroanilide hydrochloride (Sigma) as a substrate under pseudoirreversible conditions. For mutants with a detectable inhibitor activity, IC 50 s were determined in a fluorometric assay using 10 M benzyloxycarbonyl-FR-7-amino-4methylcoumarin hydrochloride (ZFR-AMC, Sigma) as a substrate and K i s calculated using the relationship K i ϭ IC 50 ( Circular Dichroism-Near (250 -320 nm) and far (190 -240 nm) UV CD analyses were performed with 0.6 mg/ml protein in 0.5 and 0.02 cm path length quartz cells, respectively, using a JASCO J-810 spectropolarimeter. Eight scans were collected for each protein using a bandwidth of 1.0 nm, a scanning speed of 50 nm min Ϫ1 and a response time of 0.25 s. Data were corrected for cell path length and protein concentration and smoothed using the Savitsky-Golay algorithm (convolution width 9.0). The secondary structure analyses of wild type and mutant proteins were obtained using the CDSSTR method (26) available from the Dichroweb website, at the University of Birkbeck. All three proteins gave similar secondary structure estimates with low NRMSD values (ϳ0.025).

RESULTS
ICP Adopts an Immunoglobulin Fold-In solution, ICP adopts an immunoglobulin-like (Ig) fold with seven ␤-strands (Fig. 1). One ␤-sheet is formed by anti-parallel strands B, E, and D, whereas the other is formed by anti-parallel strands G, F, and C with strand A parallel to strand G. A search for structural homologues using the DALI server (27) identified the N-terminal Ig domain from ␣-dystroglycan (PDB:1u2c (28)) as the closest match with a Z-score of 5.1 and an RMSD of 3.2 Å over 82 residues. The SCOP data base (29) classifies this as a cadherinlike superfamily fold, which also closely resembles the I-set immunoglobulin-like fold (30,31).
The ␤-sheets enclose a well packed core of principally hydrophobic residues that correspond to the alternating pattern of conserved resi-dues along the strands (see Fig. 2). The C-D loop of ICP folds back on strand C with a number of hydrophobic side-chains making contacts that are not part of the main core of the molecule. These interactions involving residues Met-35, Thr-37, Val-39, Val-42, Val-91, Thr-93, Pro-95, and Ile-99 serve to pin the F-G loop back to the outer face of one ␤-sheet. The prediction of the ICP family structure based on threading (13) correctly identified the Ig-like fold, but the details of the predicted strands and loops were out of register after the C-D loop, possibly because of the misidentification of the hydrophobic residues in the C-D loop as representative of an extended C strand.
The three highly conserved groups of residues previously noted (8,11,12) are all located in loops at one end of the molecule (Fig. 1B). The GNPTTGY motif lies in the B-C loop, the GXGG motif lies in the highly mobile D-E loop and the RPW/F motif in the F-G loop. The co-location of all three motifs, and the finding that none of the residues appear to have roles that are key to the integrity of the fold of the protein, suggests that they together form the CP-binding site.
Residues 30 -32 in the B-C loop form a turn of 3 10 -helix projecting the side chains of Asn-29, Thr-31, and Thr-32 toward the solvent. Gly-28 and Gly-33 allow the loop to be accommodated between the B and C strands by adopting conformations with positive dihedral angles. The conserved Tyr-34, which caps the main hydrophobic core of the protein, interacts with residues in the F-G loop, and contributes to the solvent-exposed surface at this end of the protein.   -(6 -113). Lipari-Szabo order parameters and internal correlation times plotted as a function of residue number. Two order parameters are shown for the residues that were best fit with two timescales of internal motion. The locations of the ␤-strands are indicated by arrows.
compactly folded monomeric protein of this size, and their behavior could be modeled using either a single order parameter (S 2 ) or S 2 and an internal correlation time. The exceptions are the N terminus up to residue 13 and residues 61-72 in the D-E loop, which display depressed NOE values and lengthened T 2 values. These residues were best modeled with two order parameters representing internal motion on distinct timescales and an internal correlation time of the order of a nanosecond. In addition, residue Asn-29, which could not be well fitted by any model, has a significantly shortened T 2 value indicative of millisecond timescale chemical exchange, which may reflect backbone flexibility, or given the proximity of Tyr-92 may simply be because of the combination of local motion and the ring current shift from this aromatic side chain.
The Conserved Motifs in the Loops Are Involved in CP Binding-Chemical shift perturbation mapping was used to identify the CP-binding site on ICP. Complexes of 15 N-labeled ICP with L. mexicana CPB and with P. latex papain were purified, and their 15 N HSQC spectra were compared with free ICP. A similar subset of backbone H N and side chain cross-peaks are perturbed from their free ICP chemical shifts in both complexes, and the most distinctive are shown in Fig. 3 for the papain complex. It is clear that all the most perturbed residues fall in the ranges 28 -37 (covering the B-C loop), 59 -74 (covering the D-E loop), and 94 -104 (covering the F-G loop). Chemical shifts are sensitive to changes in chemical environment caused either by proximity to a ligand or by propagated structural changes. The co-location of all the significant changes in the vicinity of the three conserved loops strongly suggests that this is because of their involvement in binding to target CPs.
Residues in the D-E Loop Are Critical for CP Binding-Based on our L. mexicana ICP structure and on sequence similarity between ICP family members, we made a limited number of mutant ICPs to test their effect on the inhibitory activity of ICP. In the highly conserved B-C loop, we mutated the pair of threonines to glycine (T31G,T32G) and mutated the conserved tyrosine at the base of the loop to alanine (Y34A). In the flexible D-E loop, we made two triple mutants, one to replace the semiconserved hydrophobic residues with charged residues (M67D,V68D,V70D) and the other to replace the glycines by more conformationally restricted residues (G69P,G71P,G72P). In the F-G loop, we made a double mutant of the conserved arginine and aromatic residues (R94A,F96A).
The D-E loop mutants lacked inhibitory activity against L. mexicana CPB. To ensure that these mutants had retained their three-dimensional fold we analyzed them using CD spectroscopy. Wild type ICP and mutant M67D,V68D,V70D possess superimposable CD spectra in both the near and far UV regions indicating very similar, if not identical, secondary and tertiary structures. Mutant G69P,G71P,G72P gave rise to a slightly different far UV spectrum, while retaining similar spectral features in the near UV region, and CDSSTR analysis placed the regular secondary structure estimates within a few percent of the wild type ICP. Of the other mutants, the B-C loop mutations resulted in somewhat lower K i s, whereas the K i of the F-G loop mutant was ϳ3-fold higher (see Table 2).
A Model for the Interaction between ICP and CPs-The identification of the three conserved loops as the CP-binding site prompted a comparison of the ICP structure with other CP inhibitors. A slight, but suggestive, structural similarity was detected with the cystatin, stefin B (see below), where the CP-binding site is formed by a short central loop flanked by a flexible N-terminal peptide and a longer flattened loop that incorporates a proline (see Fig. 4). We speculated that ICP might bind target CPs in a similar fashion with the flexible D-E loop binding in place of the N-terminal peptide of stefin B. To produce a model for the interaction between ICP and a CP, we used the chemical shift perturbation data for ICP in complex with papain, together with information from  the crystal structure of the complex between papain and stefin B, to drive a docking simulation using HADDOCK (24). The input data dictated that ICP should bind into the active site of papain but did not, a priori, dictate that ICP should bind the peptidase in a stefin-like orientation. Of the 200 lowest energy structures from the HADDOCK run, all adopted a stefin-like mode of binding ( Fig. 4) with residues 67-70 from the D-E loop filling the "unprimed" sites of the active site cleft, whereas the B-C loop approaches the active site cysteine-histidine pair, and side chains from residues in the F-G loop fill the "primed" end of the active site cleft. The preferred orientation for the interaction additionally brings the C-D loop into contact with the short helix (residues 139 -143) on the top of the R domain of papain. The feasibility of this contact is born out by the observation that the chemical shifts of residues 43 and 46 in the C-D loop can clearly be seen to change on complex formation (Fig. 3).

DISCUSSION
This study has shown that ICP has an Ig fold that acts as a scaffold for three interstrand loops carrying the most highly conserved residues in the chagasin family of ICP proteins. These three loops are located adjacent to one another at one end of the protein with the most highly conserved B-C loop lying between the structured F-G loop and the flexible D-E loop. Together, the three loops form a ridge that is of the correct dimensions to be complementary to the active sites of clan CA, FIGURE 4. Model for the interaction of ICP and cysteine peptidases (coloring as in Fig. 1). A, structural similarity of the ICP CP-binding loops to stefin B (green). B, the interface between ICP and papain modeled using HADDOCK. Peptidase substrate recognition sites and active site residues are colored on the surface of the peptidase. S2, light blue; S1, dark blue; cysteine, yellow; histidine, cyan; S1Ј, pink. ICP loops are labeled and colored. C and D, two views of the predicted ICP-papain interaction in comparison with the stefin-papain complex (4). The complexes are shown in the standard view (5) (C) and rotated 90°(D) such that the peptidase substrate would run from left to right. family C1 CPs. Chemical shift perturbation studies identify these loops as forming the binding interface with the two representative CPs, papain and L. mexicana CPB.
Comparison of the ICP structure with CP inhibitors of other families reveals no similarity at the level of the overall fold. However, there are suggestive similarities between the CP-interacting regions of ICP and the cystatin family of inhibitors. The ICP F-G loop (residues 93-103) bears a structural resemblance to the second binding loop of the tripartite wedge of stefins A and B (4,32) and superposing the structures on these features brings the B-C loop and C-terminal end of ICP strand B into the vicinity of the first interacting loop and strand B of the stefins (Fig. 4A). Superposition of both these structural features places the flexible D-E loop of ICP in the vicinity of the stefin N-terminal trunk (Fig.  4A). This suggests that the three conserved ICP loops may form a similar tripartite wedge to that of the cystatin family, both adapted to fit the active site cleft of clan CA, family C1 CPs. The implications of this similarity would be that ICP recognizes target CPs with the flexible D-E loop adapting to the recognition sites for the substrate residues N-terminal to the cleavage site (thought to be a main determinant of specificity for this group of CPs) and the F-G loop binding to the distal end of the substrate recognition groove. The importance of these interactions for the binding of ICP is demonstrated by our data for the ICP mutants ( Table 2). In particular, the changes to the D-E loop abrogated inhibitory activity. It should be highly informative to express this "dead" mutant in L. mexicana so that the role of ICP in the parasite and its interaction with its host can be studied. Additional support for the importance of the F-G loop is provided by the observation that ICP inhibits cathepsin B considerably less well than cathepsin L (7). This is also a feature of stefins (33) and can be explained by the steric hindrance afforded by the occluding loop, which is characteristic of cathepsin B-like CPs and is partially responsible for their substrate specificity.
To investigate the feasibility of this model, we carried out data-driven docking between ICP and papain with the assumption that ICP would contact the same region of the active site as the cystatins because the flexibility of the D-E loop makes the results of an any less biased approach such as rigid body docking difficult to evaluate. The HAD-DOCK modeling surprisingly revealed just one clearly favored orientation for the interaction in which the fold of ICP places the highly conserved NPTT sequence motif in the B-C loop close to catalytic residues of the target CP. The ICP B-C loop is longer and bulkier than the first interacting loop of the cystatin family and, given its disposition relative to the loops on either side, may interact intimately with the CP active site residues. Thus it is surprising that the two B-C loop mutations had no deleterious effect on inhibitory activity despite the high level of sequence conservation between bacterial and protozoan ICPs in this region. However, the changes introduced into the mutants both decreased the bulk of the B-C loop. It is known that CP active sites are expanded when in complex with stefins (4,5), and so, taking into account the bulkiness of the ICP B-C loop, wild type ICP may be larger than is optimal for the tightest interaction with this peptidase active site. The most likely explanation for this is that the loop is optimized for inhibition of an as yet untested CP other than those investigated, such as a host CP that has a wider active site. This suggestion is consistent with previous findings derived by generating mutants of Leishmania that lack or overexpress ICP (10).
Proteins with Ig folds are rare in species other than higher eukaryotes and the viruses that infect them. Hitherto they seemed to be entirely absent from protozoa. ICP is the first protein with a cadherin-like Ig domain to be discovered in a non-metazoan. This raises the possibility that ICPs were acquired by the pathogens from their hosts, although for this to be the case the event would have had to have happened on multiple occasions as gene transfer between the parasites seems unlikely. A more likely explanation is that a crucial role in the hostpathogen interaction has ensured retention of an otherwise unimportant gene by the pathogens.
Most cadherin-like Ig domains in cell-adhesion molecules mediate protein-protein interactions. However, for the cadherin domains themselves, these principally involve the faces of the ␤-sheets (34). This differs from the binding mode of ICP, which is more akin to antigen recognition by Ig variable domains. The finding that ICP has an inhibitory mechanism potentially rather similar to the cystatins, despite the unrelated primary structures and folds of the proteins, clearly demonstrates the likelihood that there has been convergent evolution. This apparent similarity also suggests that this type of protein-protein interaction is efficient for the regulatory functions of these inhibitors and, moreover, perhaps that such a tripartite binding mechanism is required.
The discovery of the binding mode of ICP may also be informative for the design of chemical inhibitors of CPs that could have therapeutic value, both against pathogens and also in the treatment of CP-related mammalian disorders. Perhaps a compound that mimics the binding of the D-E loop and also has an active site-directed nucleophile would show good activity against the CPs. Indeed the peptidyl vinylsulphone compound currently in development for use against T. cruzi infections (35,36) presumably is acting in just this way. The discovery of the physiological targets of each ICP should allow design of inhibitors that have optimal specificity for these target CPs and therefore potentially have value in therapies directed against them.