Integrating Structure, Bioinformatics, and Enzymology to Discover Function

,

The protein complement of both prokaryotes and eukaryotes remains largely uncharacterized. At least 30% of all proteins have no known biochemical function, and a larger percentage have sequence similarity to proteins of known biochemical activity (e.g. most predicted protein kinases) but for which the physiological role is unknown. The challenge in the postgenomic era is to define both the biochemical and physiological functions of all proteins as rapidly as possible.
Structural proteomics, the large scale determination of protein structure, is expected to provide insight into the fundamental mechanisms by which a protein sequence adopts a defined three-dimensional structure. Most of the organized efforts in structural proteomics (Ref. 1; rcsb.org/pdb/strucgen. html) specifically target protein sequences for which there is no known structural homologue in the public data bases at a level of 30% sequence identity. One aim of this effort is to more fully define the universe of protein folds. Importantly, because protein structure is often conserved in the absence of detectable sequence homology, the comparison of new protein structures with those of known proteins will likely provide clues to biochemical function.
The discovery of biochemical function from a new protein structure begins with automated searches for structural homologues of known function. The results of these comparisons are provided as lists with significance scores. The methods of comparison are now used routinely in the structural community and have proved invaluable for detecting structural conservation and for providing the basis for hypotheses (2). However, the interpretation of the results from structural comparisons often consumes a significant amount of time and is influenced by the extent to which the investigator is able to scour the literature.
In an effort to improve the process by which function is derived from structure, we have combined two methods to facilitate functional studies. First, we have employed a data base of structural templates derived from the active sites of 189 different classes of enzymes. 1 This exploits the fact that the chemistry of the reaction restricts the types and the topological arrangement of the catalytic amino acids and hence results in strong conservation of their spatial arrangement, even where the protein folds are very different (3). By focusing on the catalytic moieties, functional similarities can be detected in cases where there is no similarity in sequence, fold, or secondary structure. Second, we have created and used a panel of generic biochemical assays to test the functional hypotheses raised by the structural comparisons. These assays are based on simple, often nonphysiological, substrates; the experiment is designed to reveal the chemistry of the active site and not the cellular substrate.
Here we present the results of the combined structural, bioinformatic, and enzymatic analysis of Escherichia coli BioH, a target within the Midwest Center for Structural Genomics (www.mcsg.anl.gov). By comparing the crystal structure of BioH with other known enzymes, we found that BioH is a member of the protein hydrolase superfamily and contains a classical Ser-His-Asp catalytic triad. A screen with different hydrolase substrates revealed that BioH has significant carboxylesterase activity, with a preference for short acyl chain substrates, and weak thioesterase activity. The strategy used for BioH might facilitate analysis of novel, uncharacterized proteins and structures arising form structural proteomics projects.

BioH Expression and Purification-
The open reading frame of bioH was amplified by PCR from E. coli DH5␣ genomic DNA. The gene was cloned as previously described (4) into the NdeI and BamHI sites of a modified form of pET15b (Novagen) in which a TEV protease cleavage site replaced the thrombin cleavage site and a double stop codon was introduced downstream from the BamHI site. The fusion protein was overexpressed and purified using nickel affinity chromatography as previously described (4).
For the preparation of the selenomethionine enriched protein, BioH was expressed in the E. coli methionine auxotroph strain B834 (DE3) (Novagen) in supplemented M9 medium. The sample was prepared under the same conditions as the native protein except for the addition of 5 mM 2-mercaptoethanol to the purification buffers.
Crystallization-BioH was crystallized by vapor diffusion in hanging drops (ratio of 2 l of protein to 2 l of precipitant) equilibrated against reservoir containing 1.2 M sodium citrate trihydrate and 0.1 M Tris-HCl (pH 8.0). X-ray quality crystals grow at 21°C in 2-5 days. For diffraction studies, the crystals were stabilized with the crystallization buffer supplemented with 15% ethylene glycol as a cryoprotectant and flash frozen in liquid nitrogen.
Mass Spectrometry-All of the mass spectrometry data were acquired and analyzed using Masslynx 3.5 (Micromass, Manchester, UK). Electrospray ionization mass spectrometry (ESI-MS) 2 was performed on a Micromass Q-Tof2 mass spectrometer. Positive ion mode ESI-MS of the whole protein was achieved in 50:50 acetonitrile:water with 0.1% formic acid. Exact mass MS was performed in negative ion mode regular ESI-MS using 10% aqueous methanol containing 1% ammonia as a carrier solvent. Tryptic digestions were performed overnight in 100 mM ammonium bicarbonate (pH 7.8) or in 100 mM ammonium bicarbonate buffer (pH 6.4) for 1.5 h followed by MALDI-MS analysis. MALDI-MS was performed on a Micromass MALDI-R mass spectrometer (Micromass) using an m/z range of 500 -4000. ESI-MS and MS/MS analysis of the low pH tryptic digest were performed on a Micromass Q-Tof2 mass spectrometer using nano-LC with a C18 column (0.3 ϫ 5 mm; LC Packings). Data-dependent acquisition parameters were set to select the doubly and triply charged unmodified and modified precursor ions corresponding to residues 78 -100 of the protein. MS-MS spectra were processed by base-line subtraction and deconvoluted using the Max-Ent3 module of MassLynx 3.5. The peptide sequences were determined semi-automatically from the resulting singly charged, deisotoped spectra using PepSeq, version 3.3 supplied with MassLynx 3.5.
Crystallographic Data Collection-A two-wavelength multiple-wavelength anomalous dispersion experiment was carried out on the 19ID line of the Structural Biology Center at Advanced Photon Source (Argonne, IL). All of the crystallographic data were collected at 110 K on one crystal containing selenomethionine-substituted protein. The crystal belongs to the tetragonal space group P4 3 with unit cell dimensions a ϭ b ϭ 75.2 Å, c ϭ 49.3 Å, ␣ ϭ ␤ ϭ ␥ ϭ 90°. The multiple-wavelength anomalous dispersion data set was colleted using inverse beam strategy at the selenium absorption peak energy (0.97947 Å) and at a remote wavelength (0.95373 Å). The absorption edge was determined from the x-ray fluorescence spectrum and the f Ј and f Љ plots versus energy obtained with the program CHOOCH (12). High resolution data were collected from the unexposed part of the same crystal, which had been stored in liquid nitrogen. All of the data were measured with the CCD detector (13) 210 ϫ 210-mm 2 sensitive area and fast duty cycle. Control of the experiment, data collection and visualization was done with d*TREK (14), and all of the data were integrated and scaled with the program package HKL2000 (15). Some of the basic statistics of data collection and processing are given in Table I.
Structure Determination-Multiple-wavelength anomalous dispersion phasing of BioH data was carried out with the program CNS (16). Experimental phases were extended from 2.5 to 2.0 Å resolution with density modification, using data collected at the f Љ peak wavelength. With these improved phases, the initial model was built with the program ARP/wARP (17). The high quality of the phases allowed 94% of the main chain to be built automatically and most of the side chains to be placed with a confidence level of 79%. The remainder of the model was built, and all of the side chains were corrected manually using the program O (18). This model was then refined against the 1.7 Å resolution data with several macro cycles of CNS, including simulated annealing, B-factor, and positional refinements. After each macro cycle, the model was inspected, and corrections and/or additions were made  (20) suite of programs. The phasing and refinement parameters are shown in Table II.
Coordinates-The coordinates have been deposited in the Protein Data Bank under accession code 1M33.

RESULTS AND DISCUSSION
The BioH Crystal Structure-The final model of the BioH crystal structure consists of 256 residues, two molecules of ethylene glycol, and 240 water molecules. The last two residues of the model, Gly 257 and Ser 258 , were appended to the protein as a result of the cloning strategy. The first two residues of the native sequence, Met 1 and Asn 2 , were not included because of the absence of the corresponding electron density. Two molecules of ethylene glycol from the cryoprotectant were bound to the protein molecule mostly because of hydrophobic interactions. Residue 100 was unambiguously identified as Arg, instead of Gln, in electron density maps and is likely a PCRinduced mutation. The side chains of residues Glu 116 , Lys 121 , Asp 123 , Phe 136 , Glu 152 , and Lys 213 have incomplete electron density.
A small auxiliary domain is formed by the C-terminal segment of the polypeptide chain (Cys 110 -Asp 187 ) and is inserted into the catalytic domain. The auxiliary domain contains four ␣-helices, residues 122-134 (␣4), 136 -145 (␣5), 155-166 (␣6), and 173-185 (␣7), that create a bundle of two V-shaped bends (Fig. 1B). The two domains are connected by a hinge region near Cys 110 and Asp 187 . The interface between domains is stabilized by multiple hydrophobic interactions including helices ␣6 and ␣7 that run across the surface of catalytic domain and intramolecular hydrogen bonds between the carbonyl of Pro 109 and the nitrogen of Leu 188 and two hydrogen bonds between Asp 187 and Arg 189 .
Automated Structural Bioinformatics Reveals a Ser-His-Asp Catalytic Triad-One of the aims of structural proteomics is to perform more comprehensive automated analysis of protein structures to reduce the level of time-intensive human intervention. To screen new structures for potential catalytic function, we have created a data base of ϳ189 three-dimensional enzyme active site structural templates. 1 The BioH structure was scanned against this data base of using the TESS program (3). This automated search gave a close match of BioH to the Ser-His-Asp catalytic triad of lipases (21) (EC 3.1.1.3). The BioH residues involved (Ser 82 , His 235 , and Asp 207 ) matched the template with a root mean square deviation of 0.28 Å for the overlaid side chains (Fig. 2). This is well within the cut-off of 1.2 Å used for discriminating true from false matches for this template. The presence of the catalytic triad suggested that BioH might possess lipase, protease, or esterase activity. Furthermore, the serine nucleophile (Ser 82 ) is located within one of the two earlier identified Gly-Xaa-Ser-Xaa-Gly motifs (22), which is typical for acyltransferases and thioesterases.
The structure of BioH was also compared with all other known structures using conventional methods such as the DALI algorithm (23). The results from the DALI search revealed structural homology to a large number of proteins with a broad range of enzymatic functions. The closest matches with strong structural similarities include a bromoperoxidase (EC 1.11.1.10; Z score, 22  A comparison of BioH with a chloroperoxidase (EC 1.11.1.10) is shown in Fig. 3. The sequence identities between BioH and these proteins range from 15 to 25% and therefore do not suggest a specific catalytic function for BioH. Further manual analysis of these enzymes and literature review would have revealed to the expert that each contains a Ser-His-Asp catalytic triad in their active sites. Ser 82 Is Covalently Modified by a Hydrolase Inhibitor-The structural informatics provided initial evidence for the location of the BioH catalytic site. The experimental density maps also showed an unusual feature that extended from the side chain of Ser 82 (Fig. 4). The shape of the density and its environment, insinuated that the corresponding compound was covalently   FIG. 2. Superposition of the Ser 82 , His 235 , and Asp 207 residues onto the catalytic triad template. The template side chains are depicted by the thicker, transparent bonds, whereas the BioH residues are represented by the thinner, solid bonds and include the main chain atoms. The root mean square deviation between equivalent atoms in the template and matched side chains is 0.28 Å.

FIG. 3. Superposition of the E. coli BioH (magenta) and
Streptromyces aureofaciens chloroperoxidase (yellow) structures; superposition of the catalytic domains, including the loops connecting secondary structure elements. The overall architecture of the auxiliary domains is the same for both proteins, although the placement of the helices, especially of ␣5 and ␣6, relative to the catalytic domain differs. To investigate the properties of the Ser 82 modification, we analyzed the full length and trypsinized BioH with mass spectroscopy. Under denaturing conditions, two major peaks were observed with molecular masses of 29,152 Da (corresponding to the fulllength protein) and 29,306 Da with similar intensity. Treatment of the protein with mild base caused the peak at 29,306 Da to disappear over time and the peak at 29,152 Da to increase in relative intensity. In addition, a new peak was detected with mass of 172 Da, interpreted as singly hydrated 154-Da molecule (see below). We also examined the mass of the tryptic fragment of BioH that contains Ser 82 . When the tryptic digestion was done under slightly acidic conditions and examined by both MALDI and ESI-MS, only the Ser 82 -containing fragment showed a 154-Da adduct. Therefore the catalytic potential of Ser 82 seems responsible for observed additional mass attached to Ser 82 .
For crystallographic experiments and initial MALDI and ESI-MS, the BioH protein was purified in the presence of protease inhibitor phenylmethylsulfonyl fluoride (PMSF), which is known to react with the catalytic serine in hydrolases (24) and form a stable covalent adduct. Therefore it appears that BioH was modified during purification The protein purified in the absence of PMSF did not reveal this modification. These results strongly suggest that the modification corresponds to the addition of PMSF (expected ⌬m ϭ 154) at Ser 82 and that the serine possesses nucleophilic properties.
BioH Is a New Carboxylesterase in E. coli-BioH purified in the absence of PMSF was subjected to several enzymatic assays that focused on hydrolase function including carboxylesterase, lipase, thioesterase, phosphatase, endopeptidase, aminopeptidase, and bromoperoxidase. BioH demonstrated significant carboxylesterase activity (Table III) (EC 3.1.1.1) and hydrolyzed p-nitrophenyl esters of fatty acids. The enzyme showed rather narrow pH optimum (8.0 -8.5) and broad substrate specificity with a preference for short chain substrates (Fig. 5). The kinetic parameters of BioH were determined for several substrates (Table III). These results demonstrate that although BioH was most active with pNP-acetate, the K m for all C-2-C-6 substrates was essentially the same. In agreement with the results of the mass spectrometry and crystallography, BioH was strongly inhibited by PMSF (10.5% of residual activity after 10 min of incubation with 2 mM PMSF). Purified BioH showed classical Michaelis-Menten kinetics, and linear double reciprocal plots were obtained for all of the pNP substrates tested (data not shown). Purified BioH showed low enzymatic activities for thioesterase (using palmitoyl-CoA as a substrate; 186.5 Ϯ 18.6 nmol/ min/mg protein), lipase (using olive oil; 18.5 Ϯ 1.3 nmol/min/mg protein), and aminopeptidase (using leucine-p-anilide as a substrate; 3.8 nmol/min/mg protein) and showed no detectable enzymatic activity for phosphatase (using p-nitrophenyl phosphate as a substrate), trypsin-like endopeptidase (using benzoyl-arginine-p-nitroanilide as a substrate), or bromoperoxidase (phenol red and monochlorodimedone as potential substrates).
Our data combined with results reported in the literature suggest that BioH represents a novel carboxylesterase in E. coli. E. coli is known to express at least three other proteins with carboxylesterase activity: carboxylesterase YbaC (25), thioesterase TesA (26,27), and thioesterase TesB (28). BioH shows no significant sequence similarity with these enzymes (data not shown). BioH also possessed different enzymological properties compared with the other enzymes. Compared with BioH, YbaC and TesA exhibit higher affinities for the long chain fatty acids, pNP-octanoate (C8) and pNP-decanoate (C10). The specific activity of BioH for short C2 or C3 substrates was in the same range as for YbaC and at least 10 -30 times lower as compared with TesA. Both BioH and TesA also displayed thioesterase activity with palmitoyl-CoA (however ϳ13 times lower for BioH) but show no activity with acetyl-CoA as a substrate. The ratio of carboxylesterase/thioesterase activities (with pNP-palmitate/palmitoyl-CoA) was 0.3 for TesA and 1.3 for BioH.
The specificity for the short chain fatty acid esters likely arises from the fact that the catalytic site of BioH is buried between two domains (Fig. 1) and is not readily accessible for bulkier compounds. Substrates with acyl chain length of up to 6 carbons (C-2-C-6) could be accommodated within the hydrophobic crevice in the V-shaped cap domain of BioH where the invariant Phe 143 (Fig. 1B) can act as a facilitator of binding. In fact, the walls of the active site are quite hydrophobic; therefore binding of acyl substrates to BioH is likely to be mediated mostly by hydrophobic interactions, and the active site is sufficiently large to accommodate short chain substrates with very similar affinities for the C-2-C-6 range. This is consistent with the observation that BioH shows essentially same K m for C-2-C-6 substrates (Table III).
A Possible Role for BioH in Biotin Biosynthesis-In microorganisms and plants, biotin is synthesized from pimeloyl-CoA by the enzymes BioF, BioA, BioD, and BioB in a conserved fourstep reaction (29 -31). In the Gram-negative bacteria, such as E. coli, pimeloyl-CoA is produced from L-alanine and/or acetate (32) using the BioC and BioH proteins (33), whose exact biochemical roles have not been elucidated. The bioC gene is widely distributed in bacteria, whereas bioH is not found in many bioC-containing bacterial genomes; in these organisms, bioH appears to be complemented by other genes (bioG and bioK) (34). In some Gram-positive bacteria, such as Bacillus sphaericus and Bacillus subtilis, pimeloyl-CoA is produced from pimelic acid by pimeloyl-CoA synthetase (BioW) (35,36). Efforts to identify the precursors of pimeloyl-CoA in E. coli using 13 C NMR labeling studies have been inconclusive (32,37) but preclude a pimelic acid intermediate. Most studies support a mechanism based on the condensation of acetyl-CoA or malonyl-CoA moieties into pimeloyl-CoA (38). Consistent with this model, Lemoine et al. (22) identified two Gly-Xaa-Ser-Xaa-Gly motifs in BioH that are characteristic of acyl-transferase and thioesterase proteins. BioH was suggested to transfer pimeloyl units from BioC directly to CoA, and the E. coli BioC protein may function as an acyl-carrier protein involved in pimeloyl-CoA synthesis. The discovery of a BioH-CoA complex (by liquid chromatography-mass spectrometry) (39) supports a role for BioH as a CoA donor to a pimeloyl-acyl-carrier protein (or pimeloyl-BioC), releasing pimeloyl-CoA.
Our biochemical and structural data are consistent with the current model of the BioH reaction, which proposes that BioH transfers pimeloyl units from pimeloyl-BioC to CoA (22) and therefore should possess both esterase (carboxylesterase or thioesterase) and acyltransferase activities. We demonstrated that purified BioH shows carboxylesterase and low thioesterase activities and that BioH cannot use free pimelic acid for pimeloyl-CoA synthesis. Therefore, we propose that the function of BioH is to condense CoA and pimelic acid into pimeloyl-CoA. Several surface residues, Arg 138 , Arg 142 , Arg 155 , Arg 159 , and Lys 162 , which are disordered in the crystal structure but are nevertheless conserved throughout many bacteria, could potentially mediate CoA binding. It is also possible that BioC, which is proposed to function as a specific pimeloyl-acyl-carrier protein in the synthesis of pimeloyl-CoA (22), may interact with BioH and facilitate the delivery of a pimeloyl unit to the BioH catalytic site.
Perspective-Three-dimensional structures are now being generated for many proteins of unknown function. In many cases, such as for BioH, the structural data combined with existing clues in the literature, or even the intuition of an experienced investigator, can point the experimentalist in the right direction to identify and confirm biochemical function. However, as structural proteomics efforts gain momentum, there will be an increase in the number of protein structures for which there is no existing body of literature. The annotation of these proteins will demand methods that do not depend on specialists who are experts in a specific area of biology. The three-dimensional structure of BioH was analyzed using several automated methods for structural comparison and also with a series of generic enzymatic assays. This approach enabled us to rapidly characterize BioH enzymatic activity and reveal a new enzyme in E. coli. The development and refinement of these combined methods will significantly increase the value of structural genomics/proteomics results in the future by assigning biochemical or enzymatic functions to complement the structural information.