Sequence, Structure, and Evolution of Cellulases in Glycoside Hydrolase Family 48*

Background: Cellulases are non-homologous isofunctional enzymes, which prevents their unambiguous identification in genomic data sets. Results: Cellulases from glycoside hydrolase family 48 have distinct evolutionarily conserved sequence and structural features. Conclusion: Conserved sequence/structure features can be used to differentiate cellulases from non-cellulases in genomic data sets. Significance: Unambiguous identification of cellulases in genomic data is critical in searching for novel cellulolytic activities needed for bioenergy research. Currently, the cost of cellulase enzymes remains a key economic impediment to commercialization of biofuels (1). Enzymes from glycoside hydrolase family 48 (GH48) are a critical component of numerous natural lignocellulose-degrading systems. Although computational mining of large genomic data sets is a promising new approach for identifying novel cellulolytic activities, current computational methods are unable to distinguish between cellulases and enzymes with different substrate specificities that belong to the same protein family. We show that by using a robust computational approach supported by experimental studies, cellulases and non-cellulases can be effectively identified within a given protein family. Phylogenetic analysis of GH48 showed non-monophyletic distribution, an indication of horizontal gene transfer. Enzymatic function of GH48 proteins coded by horizontally transferred genes was verified experimentally, which confirmed that these proteins are cellulases. Computational and structural studies of GH48 enzymes identified structural elements that define cellulases and can be used to computationally distinguish them from non-cellulases. We propose that the structural element that can be used for in silico discrimination between cellulases and non-cellulases belonging to GH48 is an ω-loop located on the surface of the molecule and characterized by highly conserved rare amino acids. These markers were used to screen metagenomics data for “true” cellulases.

The recent exponential growth of genomic data presents a unique opportunity to search for novel cellulolytic activities. However, the absence of a clear understanding of structural and functional features that are critical for decisive computational identification of cellulases prevents their identification in these data sets. True cellulases are defined as enzymes that show biochemical activity on cellulose substrates (i.e. crystalline or amorphous cellulose). Strikingly, all known cellulases have homologs that have similar protein folds and even amino acid sequences but do not show biochemical activity on cellulosic substrates (2), which makes computational-only identification of true cellulases error-prone. Glycoside hydrolase family 48 (GH48) 4 is one of the many families defined in the CAZy (Carbohydrate-Active EnZymes) database (3) that contains biochemically confirmed cellulases. Furthermore, GH48 cellulases are considered the key component of various cellulolytic systems (4 -6). They are highly expressed in cellulolytic bacteria, such as Clostridium cellulolyticum, Clostridium cellulovorans, Clostridium josui, Clostridium thermocellum, and many others (4). In C. thermocellum, a bacterium that exhibits one of the highest rates of cellulose degradation among all known cellulolytic bacteria, GH48 cellulases are up-regulated during growth on crystalline cellulose (4). Hence, these enzymes become the most abundant subunits in the C. thermocellum cellulosome, a complex of enzymes highly efficient in cellulose degradation (4,7). Notably, complete knockout of both GH48 enzymes in C. thermocellum leads to a significant decrease in performance but does not completely abolish cellulolytic activity (4), whereas knockout of the GH48 gene in Ruminococcus albus (5) leads to nearly complete loss of cellulase activity.
Usually, only one (or rarely two or three) gene(s) encoding GH48 enzymes can be found in the genomes of cellulose-degrading bacteria (6), whereas genes for GH5 and GH9 cellulases are present in much higher numbers (8,9). Interestingly, GH48 cellulases often act in synergy with GH9 cellulases, which increases their catalytic activity dramatically (10), a feature that may be utilized for industrial application of these enzymes (e.g. "designer cellulosomes") (11).
Experimental studies revealed that some GH48 cellulases have only cellulolytic activity and thus cannot hydrolyze other substrates (i.e. xylan and mannan) (12). A few GH48 cellulases have mixed substrate specificity (e.g. they are capable of degradation of xylan (13) or ␤-glucan (14) in addition to cellulose). There are two GH48 enzymes from the beetle Gastrophysa atrocyanea that are unable to hydrolyze cellulose-containing substrates (e.g. Avicel, carboxymethylcellulose, acid-swollen cellulose, etc.), whereas they showed distinct enzymatic activity toward chitin (15) (supplemental Table S1).
Previous genomic studies have shown that GH48 enzymes are found in fungi as well as in bacteria, including Clostridia, Bacilli (both Firmicutes), and Actinobacteria. However, the presence of the GH48 cellulase (16) in the evolutionarily distant deltaproteobacterium, Myxobacter sp. AL-1, was never explained.
Here we report evolutionary studies of GH48 enzymes, present a crystal structure of the GH48 enzyme encoded by a horizontally transferred gene, and characterize structural and functional differences between cellulases and chitinases in this group of enzymes. We also show that our computational approach can be used to search for true GH48 cellulases in metagenomic databases.

EXPERIMENTAL PROCEDURES
Bioinformatics Software and Computer Programming Environment-The following software packages were used in this study: HMMER version 3.0 (17), MAFFT version 6.0 (18), MEGA version 5.0 (19), Jalview version 2.7.0 (20), and BLAST version 2.2.17 (21). All multiple-sequence alignments were built in MAFFT with its L-INS-i algorithm. All maximum likelihood phylogenetic trees were built in PhyML (22) with LG ϩ 4 ϩ F parameters. Symmetrical best hits (SymBets) were assigned using the BLAST algorithm.
All computational analyses were performed in a local computing environment, and custom scripts for data analysis were written in BioPerl. A remote version of the NCBI non-redundant database was used for direct queries using BioPerl scripts. A local version of the same database was used for querying with the hmmsearch algorithm of the HMMER package.
Data Sources and Literature Analysis-National Center for Biotechnology Information (NCBI) non-redundant database (nr) in FASTA format as of April 2011 was retrieved. A hidden Markov model of glycol_hydro_48 (PF02011) was retrieved from the Pfam database vPfam26 (23). Structures of Cel48S from C. thermocellum (24) and Cel48F from C. cellulolyticum (25) were retrieved from the Protein Data Bank (26) .
Glycoside hydrolases of family 48 (classification according to the CAZy database (3)) with known activity were identified from the literature (supplemental Table S1) and then mapped on the phylogenetic tree of GH48 enzymes in order to place the functional knowledge into the taxonomic context. Enzymes were considered to be of demonstrated cellulolytic function if their activity had been analyzed by in vitro biochemical studies.
Multiple Sequence Alignment and Construction of Phylogenetic Tree-183 GH48 protein sequences were retrieved from the NCBI nr database using hmmsearch of the HMMER package (17) with Pfam gathering threshold and Pfam domain model glycol_hydro_48 (Ͼ600 amino acid residues). Then GH48 enzymatic domains corresponding to the Pfam model were excised from the protein sequences using BioPerl scripts and further analyzed. 69 domain sequences were found to be too short (Ͻ300 amino acid residues) and thus were discarded to improve the quality of the subsequent studies. 114 GH48 sequences were taken into further analysis.
A multiple sequence alignment (MSA) of 114 GH48 domains was constructed in MAFFT. The resulting alignment was used to build a maximum likelihood tree in PhyML. The conservation pattern in the MSA was analyzed in Jalview (20) with underlying tools, and the phylogenetic tree was analyzed using the MEGA5 package. Taxonomy assignments for the proteins on the tree were taken from GenPept records from the NCBI protein database.
Identification of Orthologs, Paralogs, and Horizontally Transferred Genes-Because GH48 is typically present as one copy per genome, we assigned as orthologs all GH48 protein sequences that (i) form a monophyletic clade on the tree; (ii) were present as a single copy per genome; (iii) come from phyla with the same common ancestor after species-proteins tree topology reconciliation (Firmicutes, Actinobacteria, and Chloroflexi); and (iv) were characterized by symmetrical best matches (SymBets). Similar GH48 sequences that were present in two or more copies per genome were assigned as paralogs.
Horizontally transferred genes were defined in two ways: (i) by means of phylogenetic studies, where horizontally transferred genes were assigned based on phyletic distribution on the tree (27), and (ii) by means of a probabilistic approach (27), where the probability of occurrence of GH48 genes in prokaryotic genomes was calculated as the percentage of genomes containing GH48 genes divided by the total number of the available genomes, assuming that each genome contains only one GH48 gene (Table 1).
Metagenomic Data Analysis-We analyzed a publicly available data set of protein sequences from microbial communities in 295 metagenome samples retrieved from JGI/M (28) as of October, 2011 and the cow rumen data set from Ref. 29. Sequences encoding glycoside hydrolase family 48 proteins (glycol_hydro_48 (PF02011) Pfam domain model) were collected from metagenome data sets with hmmsearch. Duplicate sequences were removed. Cloning, Expression, and Purification of Hahella chejuensis GH48-A codon-optimized pMAL expression plasmid obtained from DNA2.0 (Menlo Park, CA) containing the H. chejuensis catalytic domain was transformed into Escherichia coli (BL21) (Agilent, Santa Clara, CA) and overexpressed at 37°C in the presence of 0.3 mM IPTG. The recombinant fusion protein contained a C-terminal maltose-binding domain and was purified using an amylose high flow resin (New England Biolabs, Ipswich, MA). The eluted fusion protein was then cleaved using a Genenase protease site incorporated into the sequence (New England Biolabs). The H. chejuensis GH48 module was further purified by anion chromatography on a source 15Q column (GE Healthcare), using buffers A (20 mM Tris, pH 6.8) and B (20 mM Tris, pH 6.8, 2 M NaCl). Minor impurities were removed by size exclusion chromatography using HiLoad Superdex 75 (26/60) (GE Healthcare) in 20 mM acetate buffer, pH 5.0, containing 100 mM NaCl and 1 mM sodium azide. The purified protein solution was concentrated with a Vivaspin 5K concentrator (Vivaproducts, Littleton, MA), and its concentration was measured using the BCA assay (Pierce).
Model Substrate and Pretreated Biomass-Avicel (PH101), and phosphoric acid-swollen cellulose generated from Avicel, were used to evaluate the cellulolytic efficiency of H. chejuensis GH48. To provide a basis for the maximum theoretical sugar yield achievable from each substrate during enzymatic hydrolysis, portions of each of the pretreated solid samples were dried and subjected to the standard two-stage sulfuric acid hydrolysis method for determining structural carbohydrates in lignocelluloses, as described by Sluiter et al. (30). In this method, the carbohydrate content of each pretreated sample is calculated from the carbohydrates released. In both cases, it is ϳ95% glucan.
Enzymatic Digestion Assays-GH48 activity was determined at 45°C, at an enzyme loading of 15 mg/g glucan Avicel or 80 mg/g glucan phosphoric acid-swollen cellulose in 20 mM acetate buffer, pH 5.5, containing 10 mM CaCl 2 and 100 mM NaCl. The assay slurry was mixed by inversion. Digestions were run continuously for 7 days, and sugar release was monitored by removing aliquots. Samples taken at various time points and the enzymes were inactivated by boiling for 15 min. Samples were then filtered through 0.45-m Acrodisc syringe filters and analyzed for glucose and cellobiose by HPLC. Samples were injected at 20 l and run on an Agilent 1100 HPLC system equipped with a Bio-Rad Aminex HPX-87H 300 ϫ 7.8-mm column heated to 55°C. A constant flow of 0.6 ml/min was used with 0.1 M H 2 SO 4 in water as the mobile phase to give separation of the analytes. Glucose and cellobiose were quantified against independent standard curves. All experiments were performed in triplicate, and the resulting extents of conversion are shown as percentage of glucan converted.
CD Methods-CD measurements were carried out using a Jasco J-715 spectropolarimeter with a jacketed quartz cell with a 1.0-mm path length. The cell temperature was controlled to within Ϯ0.1°C by circulating 90% ethylene glycol using a Neslab R-111m water bath (Neslab Instruments, Portsmouth, NH) through the CD cell jacket. The results were expressed as mean residue ellipticity, [e] mrw . The spectra obtained were averages of five scans. The spectra were smoothed using an internal algorithm in the Jasco software package, J-715 for Windows. Protein samples were studied in 20 mM sodium acetate buffer, pH 5.0, with 100 mM NaCl at a protein concentration of 0.25 mg/ml for the near-UV CD. Thermal denaturation of different constructs was monitored by CD in the near-UV (190 -260 nm) region. For the analysis of thermal stability, the temperature was increased from 25 to 60°C with a step size of 0.2°C and monitored at a wavelength of 222 nm.
Crystallization-H. chejuensis GH48 (YP_433697) crystals were obtained with sitting drop vapor diffusion using a 96-well plate with Crystal Screen HT from Hampton Research (Aliso Viejo, CA). 50 l of well solution was added to the reservoir, and drops were made with 0.2 l of well solution and 0.2 l of protein solution using a Phoenix crystallization robot (Art Robbins Instruments, Sunnyvale, CA). The crystals were grown at 20°C with 0.05 M potassium phosphate monobasic and 20% (w/v) polyethylene glycol 8000 as the well solution. The protein solution contained 15 mg/ml protein, 20 mM acetic acid, pH 5, 100 mM NaCl, and 10 mM CaCl 2 .
Data Collection and Processing-The H. chejuensis crystal was flash-frozen in a nitrogen gas stream at 100 K before home source data collection using a Bruker X8 MicroStar x-ray generator with Helios mirrors and a Bruker Platinum 135 CCD detector. Data were indexed and processed with the Bruker Suite of programs version 2011.2-0 (Bruker AXS, Madison, WI).
Structure Solution and Refinement-Intensities were converted into structure factors, and 5% of the reflections were flagged for R free calculations using the programs F2MTZ, Truncate, CAD, and Unique from the CCP4 package of programs (31). The GH48 structure was solved using molecular replacement with the program Molrep (32) with Protein Data Bank entry 1G9G as a model. ARP/wARP (33) version 7.0 and Coot (34) version 0.6.2 were used for multiple cycles of automatic and manual model building. Further refinement and manual correction was performed using REFMAC5 (35) version 5.6.0117 and Coot. The MOLPROBITY method (36) was used to analyze the Ramachandran plot, and root mean square deviations of bond lengths and angles were calculated from ideal values of Engh and Huber stereochemical parameters (37). Wilson B-factor was calculated using CTRUNCATE (31) Table S2.

Phyletic Distribution of GH48 Sequences and Horizontal
Gene Transfer-GH48 enzymes that were retrieved from databases belong to only four prokaryotic phyla (Actinobacteria, Firmicutes, Chloroflexi, and Proteobacteria) and only two eukaryotic phyla (Fungi and Arthropoda), indicating a rather unusual evolutionary history. Taking into account that Firmicutes, Actinobacteria, and Chloroflexi (i) probably shared a common ancestor (38), (ii) showed GH48 enrichment compared with other phyla (Table 1), and (iii) contained a significant number of biochemically confirmed GH48 cellulases while lacking any confirmed non-cellulases, we hypothesize that the GH48 cellulase originated in the last common ancestor of Firmicutes, Actinobacteria, and Chloroflexi. Therefore, we have first analyzed sequences only from these three phyla that satisfied two additional criteria: (i) they were present as the only GH48 gene in a genome, and (ii) they showed many-to-many symmetrical best hits (SymBets) relationships (39). As a result, 65 sequences, which included 12 biochemically confirmed cellulases, were taken into further analysis and aligned. The maximum likelihood tree constructed from this alignment was monophyletic (i.e. sequences from the same phylum were found in a single clade). In the next step, we determined the conserved residues in the alignment and found that all functionally important sites (including folding and substrate binding) were invariably conserved (supplemental Table S3).
Because paralogs typically have a similar but not identical function, we asked whether paralogous GH48 sequences may represent enzymes with different substrate specificity. If so, they should show differences in some of the highly conserved sites, especially those implicated in substrate binding. Surprisingly, we found that paralogous GH48 sequences in genomes of Firmicutes and Actinobacteria were nearly identical (90 -98% identity) and retained all conserved residues that were identified in the set of orthologous sequences. It appears that the functional innovation in paralogs resides not in the catalytic domain but in the repertoire of their auxiliary domains (Fig. 1).
The evidence of horizontal gene transfer emerges when a protein sequence from a particular organism shows high similarity to a homolog from a distant taxon (27). In the case of GH48, all sequences from Fungi were found in the middle of the Firmicutes clade, whereas all sequences from Insecta were found in the middle of the Actinobacterial clade (Fig. 2). This non-monophyletic distribution clearly suggests horizontal gene transfer into eukaryotes from the two prokaryotic phyla.
Thus, a total of 23 horizontally transferred genes were identified through phylogenomic analysis, where an implicitly defined (see above) set of orthologs showed the presence of non-monophyletic clades with representatives of Proteobacteria, Fungi, and Insecta (Fig. 2). Additionally, in prokaryotes, they were also identified by a probabilistic approach (27), where relative increases in abundance of GH48 genes in the genomes of Actinobacteria, Firmicutes, and Chloroflexi were compared with that of Proteobacteria, as described under "Experimental Procedures" (Table 1). Notably, Actinobacteria, Firmicutes, and Chloroflexi genomes had much higher probability of occurrence of GH48 genes compared with Proteobacteria, Fungi, and insects, which along with their distribution on the phylogenetic tree presents additional evidence for horizontal gene transfer into the latter. In summary, here we define all GH48 orthologs and paralogs from Actinobacteria, Firmicutes, and Chloroflexi as true cellulases based on phylogenomic analysis, which correlates with their experimentally confirmed enzymatic activities (Fig. 2 and supplemental Table S1).  NOVEMBER 30, 2012 • VOLUME 287 • NUMBER 49

JOURNAL OF BIOLOGICAL CHEMISTRY 41071
Cellulose Digestion by the Horizontally Transferred GH48-A comprehensive list of all biochemically studied GH48 cellulases is presented in supplemental Table S1). This list shows that previously studied cellulases are mostly present in Firmicutes and Actinobacteria, with a single representative of Proteobacteria (Myxobacter sp. Al-1). We determined the activity of the GH48 enzyme from a proteobacterium H. chejuensis, which was a subject of horizontal gene transfer, on both crystalline and amorphous substrates (supplemental Fig. S1). These results showed that H. chejuensis is a cellulase because it shows activity on the phosphoric acid-swollen cellulose substrate. The poor performance on the more crystalline substrate is probably due to the lack of the carbohydrate-binding module domains in our construct, which is critical for optimal performance on a crystalline substrate, such as Avicel.
Crystal Structure of H. chejuensis GH48-The structure of HcheGH48 was refined to a resolution of 1.75 Å with R and R free of 0.154 and 0.205, respectively. There is one molecule in the asymmetric unit in complex with a cellobiose molecule bound at the product position. It has an (␣/␣) 6 barrel fold with one calcium and two sodium atoms and multiple ethylene glycol, glycerol, acetate, and phosphate molecules. Due to the long crystallization time (more than 1 year), two residue modifica-tions were observed: a 2-oxohistidine at position 352 and polyethylene glycol modification of Tyr-439. There were two outliers in the Ramachandran plot, Glu-72 and Ala-73, both of them well defined by the density and close to the allowed region.
Structural Comparison with Other Known GH48s-Pairwise secondary structure matching of structures with at least 70% secondary structure similarity by PDBefold (40) found 22 unique structural matches for HcheGH48 from the Protein Data Bank. All similar structures were CelF, CelS, or CelA GH48 variants with secondary structure similarity between 79 and 88%. Closer inspection of the structure shows that the overall fold (Fig. 3)   Closer inspection of the -loop shows that it is defined by two anchor residues, Trp-508 and Asn-516 (Fig. 3). Comparison with C. cellulolyticum CelF, C. thermocellum CelS, and C. bescii CelA GH48 structures shows that these residues are conserved and have identical conformation in all four structures. The -loop of HcheGH48 differs from the others by having a proline at position 523, causing a local conformational change, where the other structures have a tyrosine, which further anchors the loop. This, however, does not change the overall conformation or position of the loop but does suggest that variability in the loop is possible without affecting activity.
Conserved Amino Acid Positions in the GH48 Family in the Context of Structure-We used sequence numbering of Cel48F from C. cellulolyticum H10 to designate amino acids in all multiple-sequence alignment studies because it is the most extensively studied GH48 structure currently available (25,41,42). Literature and MSA analysis showed that all GH48 enzymes have 100% conserved catalytic acid and base positions (Glu-55 and Asp-230 in Cel48F); thus, these residues are not discussed.
There are three major types of amino acids that participate in substrate recognition and correct folding of the GH48 enzymes: hydrophobic stacking interactions, hydrogen bonding, and calcium coordination residues (supplemental Table S3) (24,25,41,42). All of these residues are highly conserved in orthologs from Actinobacteria, Firmicutes, and Chloroflexi as well as in Proteobacteria, which indicates that genes horizontally transferred to Proteobacteria code for cellulases, a statement confirmed biochemically (this work and see Ref. 16). Our results also revealed that the GH48 enzyme from H. chejuensis does not possess any additional elements that would differentiate it from other cellulases.
Consequently, we hypothesize that fungal GH48s are also cellulases due to their high sequence similarities with cellulolytic orthologs and the fact that almost all residues important for catalysis (supplemental Table S3) are highly conserved in fungi (supplemental Table S4) with only one exception, the Ca 2ϩ coordination residues, which were considered to play a role in the thermal stability of GH48 enzymes (24) but not in substrate specificity. In contrast, GH48 enzymes from all insects are represented by non-cellulases because of the large number of amino acid substitutions in positions that are conserved among cellulases, one -loop deletion, and the lack of cellulolytic activity confirmed biochemically (15).
Mutations in critical positions were not found in all sequences from insects (supplemental Table S4). Thus, MSA and structural analyses suggested that the major difference between cellulases and non-cellulases (i.e. chitinases) from insects is the additional -loop located between Pro-469 and Ala-482 (as in Cel48F) in all cellulases. This -loop includes two residues highly conserved in all cellulolytic orthologs (99 -100% conservation): Trp-472 and Asn-481. Residue Leu-484 (as in Cel48F), located adjacent to the loop and strictly conserved in cellulolytic orthologs, is also mutated in all insects. This -loop is located on the surface of the GH48 molecule and connects two ␤-strands that form one side of the catalytic tunnel near the exit of the product (Fig. 3). Thus, here we report structural differences that occurred after an event of horizontal gene transfer from Actinobacteria to Insecta that caused mutation of cellulases to chitinases.
Screening Metagenomic Data Sets for GH48 Cellulases-Sequences of 211 GH48 proteins were retrieved from the combined metagenome data set (Ͼ79 million sequences) with hmmsearch of HMMER (17) and glycol_hydro_48 Pfam domain model with the Pfam gathering threshold. Then 36 duplicates were removed, and the remaining 175 sequences were used in protein BLAST queries. BLAST results showed that these sequences belong to the same major phyla as sequences belonging to well defined genomes that were retrieved from the NCBI nr database (Actinobacteria, Firmicutes, Chloroflexi, Proteobacteria, and insects (Arthropoda)) except for Fungi (Fig. 4). These results indicate that fungal species are either absent from the metagenomes used in this study  Defining Cellulase in the Glycoside Hydrolase Family 48 NOVEMBER 30, 2012 • VOLUME 287 • NUMBER 49 or significantly underrepresented. In summary, nine sequences from metagenomics samples belonged to insects and were classified as non-cellulases, and the other 166 sequences were classified as cellulases, based on the phylogenomic and structural evidence presented above.
To confirm the validity of this classification, 166 metagenomic GH48 sequences classified as cellulases were aligned by hmmalign of HMMER (17) with default Pfam parameters. MSA analysis (supplemental Table S5) showed that 93% of sequences have all of the residues important for protein folding and catalysis with very few conservative substitutions that were also found in some of the cellulolytic orthologs from complete genomes. A few non-conservative substitutions that were found in a small set of the sequences (ϳ7% of all) could indicate potential differences in function or could simply be sequencing/assembly errors, a rather common problem in metagenom-ics (43,44) Therefore, experimental evidence must be obtained to clarify this point.
Because metagenomic samples show a large variation in the total number of genes sequenced (e.g. a wastewater treatment plant metagenome has 30,169 genes, whereas a biofuel metagenome has 2,706,009 genes), the percentage of GH48 domains in each metagenome was calculated (Fig. 5). These metagenomes were also grouped together according to their habitats, and the percentage abundance of GH48 in each habitat was also calculated (Fig. 5).

DISCUSSION
Using a phylogenomic approach, we have determined that the GH48-type enzymes might have originated in a common ancestor of three closely related phyla: Firmicutes, Actinobacteria, and Chloroflexi (38). We have determined a number of FIGURE 5. Abundance of GH48 cellulases in metagenomes. A, percentage of GH48 sequences in each metagenome (abundance) was calculated by dividing the number of GH48 hits by the total number of genes in each metagenome. B, the abundance of GH48 sequences in different habitats. The normalized percentage of GH48 genes was calculated as the percentage of GH48 sequences in a given metagenome divided by the sum of the percentage of GH48 for all metagenomes.
gene duplication events in representatives of these phyla and several cases of horizontal gene transfer. For example, fungi received these genes horizontally from a representative of Firmicutes, whereas insects received these genes from a representative of Actinobacteria. Similarly, representatives of Proteobacteria also received their GH48 genes horizontally. By comparing orthologous sequences from Firmicutes, Actinobacteria, and Chloflexi, we identified a number of amino acid positions that are uniquely conserved in this group of organisms. Satisfactorily, the only activity that was previously found in this group is that of a cellulase. Thus, we suggest that conserved positions in the catalytic domains from Firmicutes, Actinobacteria, and Chloflexi can be used as a genomic signature for a GH48 cellulase.
We then wondered if this genomic signature for a cellulase remains intact in paralogs and horizontally transferred genes, because these types of genes often assume a slightly different function. For example, just one or a few mutations in a catalytic domain may lead to different substrate specificity. Notably, screening and study of paralogous sequences of GH48 proteins showed no significant differences in their catalytic domains but rather noticeable differences in their auxiliary domains (i.e. cellulose-binding domain, fibronectin type III-like domain, etc.). On the contrary, genes that were horizontally transferred from Actinobacteria to insects (Metazoa) acquired a new activity to hydrolyze chitin but lost the ability to degrade cellulose.
Following this initial evolutionary analysis, we extended our findings to structural analysis of GH48 enzymes. We found that all orthologs and paralogs have a 10 -14 residue -loop (Pro-469 to Ala-482 as in Cel48F) that has no counterpart in enzymes from insects. Moreover, this -loop is constituted by highly conserved amino acids (Trp-472 and Asn-481 as in Cel48F) and located on the surface of the molecule. Thus, in accord with the classical definition of -loops (45), it may play the following roles in this enzyme structure: folding, stability, or contribution to the dynamics of the enzyme during catalysis.
High conservation of the -loop residues in cellulases suggests its importance for the computational identification of cellulases, and the complete absence of the loop in all non-cellulases indicates that GH48 chitinases lost this structural element. We hypothesize that the absence of the loop in chitinases allows more conformational degrees of freedom in the active site tunnel upon binding of the substrate, which permits a bulkier chitin to "slide" freely. In contrast, cellulases may have more rigid structures "reinforced" by the -loop. Regardless of the exact role of the -loop, which can be determined only experimentally, we have suggested that it is important for cellulolytic activity, which has allowed us to design a strategy to identify new cellulases in metagenomic data.
Thus, phylogenomic and structural analyses of GH48 suggest that proteins from Actinobacteria, Firmicutes, Chloflexi, and Proteobacteria are indeed cellulases. Biochemical activities of GH48 proteins from two Pyromyces species have never been studied; thus, it is unknown whether they are cellulases. However, because these proteins are not only homologous to known cellulases but also contain all conserved amino acids identified in our analysis, it is very likely that they also possess cellulolytic activities. On the other hand, GH48s from insects, where only chitinolytic activities were detected experimentally, are noncellulases. Consequently, the existing Pfam model for GH48 can be used to retrieve true cellulases; however, there is one exception. GH48 proteins from insects should be annotated as non-cellulases. This approach allowed us to identify 166 true cellulases in the combined metagenomic data set of hundreds of environmental samples. The largest number of cellulases came from the metagenomes of "engineered" microbial communities, such as enriched samples or bioreactors (e.g. the "mixed alcohol bioreactor" and the "cellulolytic enrichment from sediment of Great Boiling Springs"). Most of the environmental cellulases come from communities that typically include saprophytes (46), such as soil, wastewater, ant fungal gardens, and the rhizosphere (Fig. 5), which is in agreement with previously published research (47,48). Interestingly, very few GH48 cellulases were identified in cow rumen microbial communities, which also correlates with previous extensive biochemical analysis of this classical cellulolytic community (29). Moreover, all of the GH48s from cow rumen, found in this study, belong to Ruminococcus flavefaciens, a highly specialized cellulose degrader. We hypothesize that because, collectively, major ruminal cellulolytic specialists are found to represent as little as 0.3% of the total bacterial population (49), and R. flavefaciens is typically one of the three most abundant cellulolytic bacteria in cow rumen (50), its GH48 gene was more selective for sequencing (51) when compared with the genes of other "rare" members of the community.

CONCLUSIONS
High-throughput computational screening for cellulases from genomic and metagenomic data sets is a challenge due to the absence of a clear understanding of structural and functional features that distinguish them from closely related enzymes with other substrate specificities (2). Here, we present a combined sequence-structure approach leading to the identification of clear markers that can be used to distinguish between cellulases and non-cellulases within the GH48 family. This approach was applied to identify "true" GH48 cellulases in large metagenomic data sets, illustrating its feasibility in the search for novel cellulolytic capabilities.
Finally, we propose that this approach can be generalized to define genomic signatures for identifying cellulases in other CAZy families (2), such as GH5, GH9, GH12, GH45, and GH61, that are known to contain biochemically confirmed cellulases.