JBC PeproTech; Our Business is Cytokines!

HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


Originally published In Press as doi:10.1074/jbc.M402925200 on May 17, 2004

J. Biol. Chem., Vol. 279, Issue 37, 38683-38692, September 10, 2004
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Data
Right arrow All Versions of this Article:
279/37/38683    most recent
M402925200v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Rosén, M. L.
Right arrow Articles by Wieslander, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Rosén, M. L.
Right arrow Articles by Wieslander, A.
Social Bookmarking
 Add to CiteULike   Add to Complore   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Recognition of Fold and Sugar Linkage for Glycosyltransferases by Multivariate Sequence Analysis*

Maria L. Rosén{ddagger}, Maria Edman{ddagger}§, Michael Sjöström§, and Åke Wieslander{ddagger}

From the {ddagger}Department of Biochemistry & Biophysics, Stockholm University, SE 106 91 Stockholm and the §Department of Chemistry, Organic Chemistry, Research Group for Chemometrics, Umeå University, SE 901 87 Umeå, Sweden

Received for publication, March 16, 2004 , and in revised form, May 17, 2004.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Glycosyltransferases (GTs) are among the largest groups of enzymes found and are usually classified on the basis of sequence comparisons into many families of varying similarity (CAZy systematics). Only two different Rossman-like folds have been detected (GT-A and GT-B) within the small number of established crystal structures. A third uncharacterized fold has been indicated with transmembrane organization (GT-C). We here use a method based on multivariate data analyses (MVDAs) of property patterns in amino acid sequences and can with high accuracy recognize the correct fold in a large data set of GTs. Likewise, a retaining or inverting enzymatic mechanism for attachment of the donor sugar could be properly revealed in the GT-A and GT-B fold group sequences by such analyses. Sequence alignments could be correlated to important variables in MVDA, and the separating amino acid positions could be mapped over the active sites. These seem to be localized to similar positions in space for the {alpha}/{beta}/{alpha} binding motifs in the GT-B fold group structures. Analogous, active-site sequence positions were found for the GT-A fold group. Multivariate property patterns could also easily group most GTs annotated in the genomes of Escherichia coli and Synechocystis to proper fold or organization group, according to benchmarking comparisons at the MetaServer. We conclude that the sequence property patterns revealed by the multivariate analyses seem more conserved than amino acid types for these GT groups, and these patterns are also conserved in the structures. Such patterns may also potentially define substrate preferences.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Glycosyltransferases (GTs)1 are one of the largest and most diverse enzyme groups in all living cells. This enzyme group performs many critical functions such as the synthesis of glycogen and carbohydrate-polymers, they act on proteins that mediate cell-cell interactions and glycosylate transcription regulators (1). Hence, the variety of acceptors that GTs act on is highly diverse, with saccharides, lipids, proteins, and nucleic acid being the most common. To reflect the structural variation of the acceptors and donators that glycosyltransferases can use, and the fact that the sequence similarity within the glycosyltransferases is low, a large diversity of folds of these enzymes has been expected (2). Glycosylhydrolases, enzymes performing the reverse reaction, have been found to have many different fold types (2), but so far only two different folds have been discovered within the solved crystal structures for the GTs, named GT-A and GT-B, respectively (3). A third glycosyltransferase group (GT-C) has been discovered by iterative BLAST searches and by structural comparisons (4). These proteins are integral membrane proteins with the active site in the long loop, and with the transmembrane helix number varying between 8 and 13. The GT-C family can also be found with Hidden Markov method searches within the GT families. This method has also identified a fourth family, unique for eukaryotes, named GT-D (5).

The GT-A fold consists of two tightly associated {beta}/{alpha}/{beta} domains, of varying sizes, with separated nucleotide (SGC) and acceptor binding domains (6). The majority of the proteins in this fold group have a short N-terminal cytoplasmic domain followed by a transmembrane (TM) segment, a stem region to reach out from the membrane, and finally the large globular enzyme part (3). The GT-B group has two similar, but less tightly associated Rossman-like {beta}/{alpha}/{beta} fold domains and are frequently membrane-associated (7). However, only a very limited number of proteins from the GT-B group seem to have TM segments. The GT-A group presently consists of nine solved crystal structures from sequence families GT-2, GT-6, GT-7, GT-8, GT-13, and GT-43, and the GT-B group of seven structures from GT-1, GT-4, GT-20, GT-28, GT-35, GT-63, and GT-64, respectively, in the CAZy systematics sequence data base (available at afmb.cnrs-mrs.fr/~cazy/CAZY/index.html). Reaction mechanisms of both retaining and inverting types are included in both A and B fold groups (denoted clans). Within the GT-A fold family with the retaining mechanism, the stereochemistry of the C1 position of the donor sugar substrate is conserved and well studied (2, 8). The retaining mechanism is poorly understood and may consist of a few steps involving stable intermediates. Hence, there is no coupling between reaction mechanism and fold.

Most comparative studies of glycosyltransferases have been based on amino acid sequence comparisons using BLAST and similar methods, which mainly will account for amino acid similarities and identities at specific positions. In the CAZy data base, glycosyltransferases have been divided into about 70 families based on such sequence similarities (9), over a comparatively short stretch of the protein. To be classified within the same family an E value of less then -3 over at least 100 amino acids is needed (9). The data base consists of both predicted ORFs and fully functionally determined proteins. GTs within the same sequence family can have highly diverse substrate specificities, e.g. like members of family 2, but still share substantial sequence similarities (10). The opposite has also been recorded: enzymes having very low sequence similarity can utilize the same substrates (7) and have the same fold (11). There is often a low sequence similarity between different families, and the fold within a family is expected to be conserved for its members (9, 11, 12). The catalytic mechanisms are also expected to be conserved in each CAZy family (13). However, in general very few amino acid positions are conserved among the GTs, and the sequence similarities for the structures within the two established fold groups are surprisingly low. Furthermore, the number of predicted glycosyltransferases (from translated gene sequences) within an organism does not reflect the size of the genome. For example Escherichia coli (K12) with a genome size of 4639 kb has 34 predicted glycosyltransferases according to CAZy, Bacillus subtilis 4215 kb has 28 GTs, Synechocystis 3573 kb has 61 GTs, and finally Mycoplasma pneumoniae of 816 kb has three predicted GTs. To predict the function of new or encoded glycosyltransferases, the ORF of interest needs to have a close sequence similarity with an enzyme of known function. Likewise, predicting the function of new enzyme groups using sequence similarity methods is almost impossible.

A new approach is needed. One possible way used here to search for new GTs, would be to look for physico-chemical properties patterns along the whole sequence. Protein sequences behave in manners far from random, and the amino acid sequence is organized to reflect the structure (14). Hence proteins with the same (or similar) structure or function are expected to have similar property patterns in the sequence, even at low sequence similarities, and over their full lengths or only partially (e.g. for certain domains). This is very evident for proteins with repeated motifs, e.g. TIM barrels (14, 15), and must be valid for others as well. Furthermore, for proteins with the same function but different folds, certain local patterns in the sequence determining the function can still be the same. GTs of a given family performing the same reaction would be expected to have similar properties. With the multivariate sequence analysis methods, based on amino acid properties, proteins with no sequence similarity can be grouped together, visualizing conserved property patterns within the proteins. The sequence is translated into values describing different amino acid properties, i.e. hydrophobicity, bulk volume, polarizability, and charge. The periodicity of properties along a sequence is calculated followed by a multivariate analysis (16). Furthermore, sequence length variations are of less importance. This method has been used to predict location of cellular proteins in Synechocystis2 to characterize E. coli compartment proteins (17) and to classify signal peptides (18).

In the present study, as a first step, a reference set of glycosyltransferases from families with known structure members was compiled from enzymes classified by CAZy. The multivariate analysis method could conveniently separate the three different fold types from each other and furthermore divide the two different reaction mechanisms within the A and B structural fold groups using sequence information. The ability to predict the fold and the sugar orientation properly was also established. Furthermore, ORFs of unknown function and that share no sequence homology with known glycosyltransferases could potentially be identified with this method to be glycosyltransferases.


    EXPERIMENTAL PROCEDURES
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Data Set—In this study 141 selected glycosyltransferases from 13 different CAZy families (see Supplementary Material) were used as reference sets. The data set mainly includes sequences with known function or with high sequence similarity with proteins with known function, where the probability for the same function is high. The reference sets include both prokaryotic and eukaryotic enzymes, using various substrates and acceptors. Full-length amino acid sequences were used, i.e. no signal peptide or anchor domains found especially in the GT-A group have been removed. The number of amino acids in the protein sequences varies between 229 and 994. This data set is non-redundant and contains no pair of sequences that share more than 55% identity.

Multivariate Sequence Analysis—The aim of this study is to search and use periodic physical features in the proteins that separate GTs according to structure and reaction mechanism. Amino acids can be described and characterized in a number of ways. Parameters such as retention times in different chromatographic systems, electric properties, molecular mass, etc. are commonly used, as described by Wold et al. (16). To decrease the number of parameters that describes each amino acid, and at the same time keep the information in a few variables, so-called z scales were used. These z scales are derived from 29 physical-chemical experimental parameters for the amino acids with principal component analysis (19). They can approximately be translated as z1 for "hydrophobicity/hydrophilicity," z2 for "bulk of side-chain," and z3 for "polarizability/charge" (see Table I). To describe the periodicities in a protein, autocross-covariance in the z values, were used (ACC) (16). The ACC program multiplies the first z value for the first amino acid with the second, followed with the first and third up to the highest lag. The same procedure was performed with the second and the third z scales and all combinations of them (see Fig. 1). The ACC terms are the average value for every interaction term at different lags, hence z(1)1 x z(1)2, z(1)1 x z(2)2, z(1)1 x z(3)2, etc. For short peptides such as signal peptides, the size of the maximal lag is dependent on the shortest peptide in a set, but for large polypeptides such as glycosyltransferases, the optimal lag seems to be between 15 and 25 amino acids. After optimization, a window size of 19 was chosen in this study, which gives 171 variables (19 x 3 x 3) for each sequence. Autocovariances with lag = 1, 2, 3... L were calculated by Equation 1,

(Eq. 1)


View this table:
[in this window]
[in a new window]
 
TABLE I
Descriptor scales for the coded amino acids

Derived by a principal component analysis of 29 physico-chemical properties for the amino acids (19). z1 reflects "hydrophobicity/hydrophilicity," z2 side-chain "bulk volume," and z3 "polarizability and charge," respectively.

 



View larger version (38K):
[in this window]
[in a new window]
 
FIG. 1.
The steps used in analyzing protein sequences with an ACC description of the periodicity of the protein sequence followed by PLS-DA analysis. A, proteins are collected from CAZy with a known fold type; B, the amino acids are translated into the three z scales (from Table I) and are used for calculation of autocovariance and cross-covariance variables (C), and the data are collected in a data matrix (D). A discriminant vector for each class is constructed and used as Y in the PLS-DA (e.g. y (GT-A) and y (GT-B)).

 
where index j is used for the z scales (j = 1, 2, 3), index i is the amino acids position (i = 1, 2,... n), and n is the number of amino acids in the sequence (cf. Fig. 1). The crossed covariances between the two different scales j and k are given by (note the difference between ACCjk and ACCkj) Equation 2.

(Eq. 2)

Partial least squares projections to latent structures discriminant analysis (PLS-DA) finds the relationships between an X matrix and a Y matrix, i.e. it finds the relationship between sequence properties of the X matrix and a Y matrix defining the group membership (20). In this study the Y matrix was composed of dummy variables, hence a value of 1 was given to members of the same group and 0 for non-members. The method uses a class membership in the Y matrix, that in PLS is composed of features that are responses to the variables in the X matrix, here fold group or reaction mechanism. PLS-DA uses the assumption that sequences belonging to the same class have common features and therefore will behave similarly in the analysis, as visualized in the score plot. This method can also be used to predict relationships for new unclassified sequences. In PLS-DA a multidimensional space is formed where every variable (e.g. z(1)1z(1)2, z(1)1z(2)2, z(1)1z(3)2) represents one dimension and every object (data from one sequence) is a point in this space. For the reference set, this means 141 points in a 171 dimensional space. To get the first PLS component a line that best approximates the data (best separates the objects according to group classification) is fitted to the multidimensional space, and the data points are projected onto this line to get coordinate values or scores. To get the second component, a new line is fitted to the data space that describes the second-most of the variation. Only two components are plotted at one time, hence the first, best separating dimension and the second in this work. The PLS model contains information both regarding the relationship among the objects (scores) and the contribution of the variables to the model (PLS weights). A weight plot shows the contribution of the variables for the separation, hence what periodic features that are responsible for the separation between the fold structures or reaction mechanisms. These features can then be searched for on the sequence level. The objects best separated, hence localized far away from the center of the plot, are the ones best described by the variables with the largest weight. Such variables are therefore most likely to be identified at the sequence level in these proteins.

To evaluate the complexity, i.e. the numbers of PLS components to use in the model, cross-validation was preformed, where all objects are withdrawn, here 1/10 at a time, and their y values were predicted from an updated model based on 9/10 of the objects. This procedure was repeated ten times. A Q(cum)2 value was calculated that describes how much of the variance in the Y matrix can be predicted by the model. To obtain a perfect score of 1, all objects should be predicted back to the exact position given by the Y matrix. A Q(cum)2 larger then 0.1 corresponds to a 95% significance of the model (21).

Sequence Alignment Methods—Lalign was used to make pair-wise sequence alignments (www.ch.embnet.org/software/LALIGN_form-.html). To combine the pair-wise alignments ClustalW was used (22). The ACC variables were searched for in the sequences with the use of the established alignments from the above method. The sequences were translated to the z values found to be important followed by sliding along the translated sequences according to the ACC variable, e.g. aligning the first amino acid with the 18th according to variable z(1)1z(1)18 for the GT-A group (see results and Fig. 4 below). The products (according to e.g. z(1)1z(1)18) between the two aligned amino acids at the positions indicated were then calculated, and the values were compared for the aligned positions. Product values with equal signs (cf. z values in Table I), hence positive or negative values, were marked. These positions were indicated on the sequence alignment (cf. Fig. 4 below), and in selected protein structures using Swiss-PdbViewer (23). The prediction of transmembrane segments was done using TMHMM at www.cbs.dtu.dk/services/TMHMM-2.0/(24).



View larger version (24K):
[in this window]
[in a new window]
 
FIG. 4.
Sequence alignment of selected GTs within the reference sets. The variable pairs indicated were aligned with ClustalW. p, first position in positive variable pairs, e.g. z(1)1z(1)18 (indicated from the PLS weight plots), (+) the second position (e.g. 18); n, first position in a negative variable pair, (-) the second position. Conserved amino acids are marked with (*), (:) amino acids with the same properties, and (.) similar properties as marked in ClustalW.

 

    RESULTS
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Fold and Reaction Mechanism Revealed by Multivariate Analysis—Do glycosyltransferases contain sequence property periodicities related to protein architecture and enzyme function? GTs from prokaryotes and eukaryotes using many different substrates were used in the reference sets selected from the CAZy classification and mainly from families where members have at least one established crystal structure. A few related families were also included when needed. Proteins belonging to the same CAZy family are believed to have the same fold. Multivariate data analysis, here PLS-DA, could divide the glycosyltransferase sequences according to the established (or predicted) fold (Fig. 2). In the structure division between GT-A, GT-B, and the third newly discovered GT-C group, with transmembrane topology, a Q(cum)2 value of 0.730 was obtained (Fig. 2A). This is a high Q(cum)2 value ("prediction ability" of the model, maximum is 1.0; cf."Experimental Procedures"), but the separation between GT-A and GT-B was less pronounced and partially overlapping. Note that the number of transmembrane segments seems to have less impact on the distribution of the GT-C sequences, i.e. no grouping. To further investigate the differences between the different fold groups, GT-A and GT-B were each analyzed separately with GT-C, yielding Q(cum)2 values of 0.868 (Fig. 2B) and 0.91 (data not shown), respectively. A division between the GT-A and GT-B yielded a Q(cum)2 of 0.655 (Fig. 2C). A further separation within the GT-A and GT-B groups was achieved on the basis of a reaction mechanism (Fig. 3, A and B), and Q(cum)2 values of 0.702 and 0.562, respectively, were obtained for the inverting and retaining clans within these 2-fold groups. Hence, multivariate ACC analyses of potential sequence property profiles for glycosyltransferases revealed clear groupings for fold and reaction mechanism by robust PLS models with high cross validation values.



View larger version (23K):
[in this window]
[in a new window]
 
FIG. 2.
Fold division of the glycosyltransferase reference sets based on sequence property patterns. Q(cum)2 values ("prediction ability" of the model) for each comparison are given in the plots, showing the two first score vectors t1 and t2 from the PLS-DA. The predicted transmembrane segment content within each enzyme is indicated with numbers and the fold group with colors. GT-A, red; GT-B, blue; GT-C, green; and predicted E. coli GTs in black. A, division between the GT-A, GT-B, and GT-C fold types; B, division between GT-A and GT-C; C, GT-A and GT-B division; D, GT-A, GT-B, and GT-C fold division with the E. coli GTs predicted. PS stands for prediction set.

 



View larger version (30K):
[in this window]
[in a new window]
 
FIG. 3.
Division between the reaction stereochemistry within fold types. A and B showing the two first score vectors t1 and t2 from the PLS-DA. Based on sequence property patterns; retaining ({square}) or inversion ({blacktriangledown}) of the sugar linkage within the GT-A (A) and GT-B (B) fold types. The Q(cum)2 values are given for each division. C, the corresponding PLS-weight plot (w1*c1/w2*c2) for the GT-B analysis in B, showing the x variable and y variable weights (w* and c, respectively), where variables to the far left and right in the plot have the most separating information. A variable found on the same side as the inversion group is positively correlated to this group and negatively correlated to the retaining group. A weight plot for GT-A is not shown.

 
Separating Sequence Features—A PLS weight plot showing the distribution of each variable for the reaction mechanism division for each score plot can be drawn (cf. "Experimental Procedures"). Here only the corresponding weight plot for the division between retaining and inverting GTs for the GT-B group are shown (Fig. 3). Variables that are on the same side as the sequence class of interest in the score plot are positively correlated, and the variables on the diagonal side of the plot are negatively or oppositely correlated. The variable information was used to investigate if properties responsible for the separation of reaction mechanism could be outlined in the sequences. Sequences from proteins with known structure and the same reaction mechanism and fold were aligned to search for the variables in the sequences. The proteins used to represent the retaining enzymes of the GT-A fold group were: {alpha}3GalT (PDB, 1FG5 [PDB] ) (25), GTA (PDB code, 1LZ0 [PDB] ) (26), GTB (PDB, 1LZ7 [PDB] ) (26) from CAZy family GT6, LgtC (PDB, 1G9R [PDB] ) (27) from family GT-8, and the inverting {beta}4Gal-T1 (PDB, 1FGX [PDB] ) (28) GT7; GlcaT-I (PDB, 1FGG [PDB] ) (29) family GT43; and SpsA (PDB, 1QGQ [PDB] ) (30) GT2 and GnT1 (PDB, 1FO8 [PDB] ) (8) from family GT13. The chosen sequences were aligned by various methods, both multiple (ClustalW) and pairwise (LALIGN) alignments, and for both full-length and partial protein sequences. This was performed to establish if there are any conserved regions between the different GT families achieving the same stereochemistry for the sugar linkage. A stretch of 84 amino acids in the retaining group (cf. above) could successfully be aligned. The members of family 6 and 8 have no significant sequence similarity when comparing the whole proteins. However, no long gap-free sequence alignment between all the different proteins belonging to the inverting group (cf. sequence Table in Supplementary Material) could be found using the above methods. A selection of variables with high weights found to be important for the reaction mechanism separation within fold group GT-A (as illustrated in Fig. 3) were (in rank) z(2)1z(3)13, z(2)1z(2)5, z(1)1z(1)18, z(2)1z(2)10, z(2)1z(1)4, and z(1)1z(1)14. Important variables at shorter distances were size (z(2)) and hydrophilicity/hydrophobicity (z(1)) patterns, e.g. positions 1 to 4 and 1 to 5, corresponding to 3 and 4 amino acids apart, such as amino acids on the same side of a helix. Autocorrelation analyses have shown that proteins with several helices have a strong periodicity of hydrophobicity, of ~3.7 (31). This is close to the 1 to 5 and 1 to 4 positions in this work (cf. above). The longer correlation distances (1 to 13, 1 to 18 above) are longer than the average lengths of alpha helices and beta strands in proteins, but in Rossman-like folds important functional residues are frequently found in the connecting loops (32, 33). The established alignments between the selected retaining enzymes were used to search for the latter variables at the sequence level, but shorter distances, hence z(2)1z(2)5, z(2)1z(1)4, may be more difficult to visualize due to the number of helices in the structures. The variable z(2)1z(3)13 was also used in the search for property patterns in the alignment, but the pattern was not as clear as the z(1)1z(1)18 variable and could be of importance in other regions of the proteins. Parts of this alignment coincidentally overlap with the UDP binding site of the donor substrates (34). Using the z values from Table I the correlation for the product of the variable z(1)1z(1)18 along the sequences in the alignment was tentatively identified (Fig. 4). Within the established alignment, 12 variables (i.e. products) were negative and 5 positive for the retaining GT-A group (Fig. 4, top section). This variable should, according to the PLS weight plot, be negatively correlated to this group. Positions in the alignment that are members of such variable pairs can be superimposed onto each other in the corresponding crystal structures indicating similar position in space, as illustrated for the GT-B fold group below (Fig. 5). However, no comparison could be made between the retaining and inverting GT-A groups (clans), because a useable comparative alignment for the inverting mechanism sequences could not be established for the families involved.



View larger version (47K):
[in this window]
[in a new window]
 
FIG. 5.
Important sequence pattern positions in active site structures. A, ribbon structure of MurG (PDB code) from E. coli with the UDP-GlcNAc donor substrate shown in yellow. Amino acids conserved in the variable sequence pairs and found around the active site are marked in green. B, the UDP-GlcNAc binding region with the variable sequence pairs indicated: Ala-253, Asp-256, Ser-262, Ser-268, Ala-271, and Pro-276. C, MurG and GtfB (PDB) ribbon structures of the UDP-binding area with the amino acids Gly-300, Ala-303 Gly-309, His-315, Ala-318, and Pro-323 in GtfB superimposed onto the MurG structure. D, the corresponding region in alDGS lipid GT from Acholeplasma laidlawii. The positions marked, Asp-233, Val-235, Glu-242, Val-248, Glu-250, and Pro-257, are members of negative variables pairs.

 
In the GT-B fold group the most important variables were (in rank) z(2)1z(2)5, z(1)1z(1)16, z(2)1z(2)8, z(1)1z(3)14, z(3)1z(1)11, and z(2)1z(2)2 for the separation of the retaining and inverting sequences (Fig. 3B), according to the PLS weight plot (Fig. 3C). More of the major variables here seemed to involve correlations over shorter distances than for the GT-A group; e.g. z(2)1z(2)2 indicates side-chain volumes for residues on opposite sides of {beta}-strands, and z(2)1z(2)5 volumes for residues next to each other on the face of an {alpha} helix, respectively. The alignment method described above was also applied here, and established between sequences from the GT1 and GT28 families, with no or low sequence homology over the whole proteins. GtfB (PDB, 1IIR [PDB] ) (35) and MurG (PDB, 1NLM [PDB] ) (36) were chosen to correlate the alignment to the structures. Two more sequences from each family were chosen to get a more stable alignment. Furthermore, to compare the inverting and retaining mechanisms, four sequences were chosen from family GT4, a retaining family, but without solved crystal structures. Two of these have a well established (validated) fold model, the alDGS and spMGS lipid glycosyltransferases (37). An alignment could be made for both reaction mechanisms for these proteins over the same region, which contains the UDP-binding sites (Fig. 4, middle/bottom sections). The variable z(1)1z(3)16 (cf. above) was traced at the sequence level by its z products and should be positively correlated to the inverting enzyme group according to Fig. 3. In the alignment 18 positive and 7 negative pairs was found within the inverting group, and 3 positive and 7 negative pairs were found for the retaining families (Fig. 4), confirming the importance of the z(1)1z(3)16 variable. A visualization of the donor substrate binding regions in the MurG and GtfB structures (both inverting) showed that these positions seem to occupy similar positions in the structure space (Fig. 5, A-C), potentially so also in the model for a retaining enzyme (Fig. 5D). The difference between inverting and retaining reaction mechanism within the GT-B fold group sequences was the amino acid hydrophobicity. In the inverting group, where positive values are important, the property pattern pairs consist of one hydrophobic and one hydrophilic amino acid (see Table I). The retaining group should have equal sign hydrophobicity values, hence two hydrophobic or two hydrophilic amino acids at positions 1 and 16. The pattern was similar for the GT-A fold group, where the variable is z(1)1z(1)18, however, the pattern has not yet been identified in the inverting enzyme clan.

Predictions from Genomes—The three GT fold groups could be well separated from each other, where the division between GT-C and each of the two others separately were stronger (Fig. 2). All proposed GTs from E. coli annotated in the CAZy data base were used to evaluate the prediction power of the ACC/PLS model. The results from the fold predictions are shown in Fig. 2D and Table II. Many family members of the GT-A fold group are known to have an N-terminal helix that anchors the protein to the membrane; TM segments in the GT-B group are, however, rare. A pair-wise division between the GT-A and GT-B groups individually with the GT-C group was also made, and the E. coli proteins were predicted into these models (data not shown), revealing the same results as with the 3-fold groups together. Evaluation of Table II shows that seven E. coli glycosyltransferases were predicted to belong to the GT-C fold group. All these have one or more hydrophobic TM segments each. No proteins without TM segments were incorrectly grouped to this family for the E. coli set. E. coli has four other proteins that have one proposed TM segment each, but they were not grouped to the GT-C fold. The latter ones only come from two families, GT8 and GT51. GT8 has a member with a solved crystal structure, LgtC of the GT-A fold group (27). The C-terminal part of LgtC consists of a domain very rich in basic residues and several hydrophobic and aromatic residues in an amphipathic organization, but no TM segments. Hence, presence/absence of amphipathic segments seemed of minor importance (data not shown). This domain is believed to bind to the membrane and is relatively conserved within the family (38). The E. coli proteins described above from GT8 have the conserved amino acids (data not shown) indicating that these protein are correctly grouped; hence they do not have any TM segments. Multiple fold predictions, made by the superior MetaServer (39) (available at bioinfo.pl/meta), were used to evaluate the ACC/PLS predictions here. The family GT51 had neither of the established fold types according to the MetaServer, supporting the ACC/PLS results. The E. coli glycosyltransferases with predicted transmembrane helices from GT2 all have the GT-A fold as a separate domain in between or after the integral membrane domains. Generally, the results from the MetaServer coincided with the results obtained with the multivariate method used here (see Table III).


View this table:
[in this window]
[in a new window]
 
TABLE II
Predictions from the E. coli GT set

The scores describe the probability to belong to a group, with 1 being an absolute score. The highest scores within each group are highlighted. The MetaServer results are shown to the right. GT26, GT51, and GT66 have none of the three known fold types according (5), but GT26 has been proposed to have the GT-B fold type by Liu and Mushegian (4).

a EC201; E. coli GT sequence 1 of CAZy family 2. EC6601; sequence 1 of family 66.

b Score in the three-dimensional Jury compilation of the MetaServer (39), 50 corresponds to a prediction accuracy of above 90%.

 


View this table:
[in this window]
[in a new window]
 
TABLE III
Prediction yields from different fold and reaction mechanisms for the E. coli and Synechocystis genome sets

Intermediate prediction results, hence results close to 0.5 have been removed (see Table II).

 
The 61 annotated, potential glycosyltransferases from Synechocystis found in the CAZy data base (data not shown) were also analyzed by the same method. Here, ten GTs were predicted to belong to the GT-C group. The separation between the GT-A and GT-B fold, and versus the GT-C group (like in Fig. 2, C and D), was also performed for the Synechocystis GTs. The results were again very similar between the different classification methods (Table III). The proteins grouped to the GT-C fold type all have a high hydrophobic TM segment content. There were, however, a few exceptions: one protein from family 19 was grouped to GT-C group but had no predicted hydrophobic TM segment, and there were eight GTs that had predicted TM segments, but were not recognized by the ACC method. Within the latter, three were from the CAZy GT51 family. This group was not grouped to the GT-C fold type (cf. above), even though they all had one predicted TM segment (but not experimentally verified). A total evaluation of the prediction results using the MetaServer for "benchmarking" revealed that only 10 out of 54 or 19% were "incorrectly" grouped (including TM ones). These results include classification of proteins with TM segments as a correct result. Within these ten, two contain both the GT-A and GT-B fold types (gene numbers sll1528 and slr1063) and had intermediate PLS results, grouping to either of the groups with a prediction value close to 0.5. These enzyme sequences were spliced (in silico) according to fold type. This division distributed the new sequence segments to the "correct" fold types (data not shown). Two other proteins (slr1816 and slr0626) also consisted of multiple domains with different functions, like a known GT fold linked to a Trp domain involved in protein interactions (40). The domain with a known GT fold fell into the correct fold type group when spliced (data not shown).

In summary, glycosyltransferases can be classified and analyzed by multivariate analysis methods on the basis of sequence property patterns. The method could successfully predict the fold of GTs and the orientation of the catalyzed bond between the donor sugar and the acceptor molecule, i.e. retaining or inversion mechanisms. Potentially important amino acid positions in the donor sites were also suggested. From genomic analyses new fold types, where the two major folds GT-A and GT-B were found within the same protein, could also be recognized.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 
Fold predictions of glycosyltransferases have recently been discussed by many (4, 41, 42). This is an important enzyme group, because the enzymes utilize many different sugar substrates and act on an extensive number of acceptors, with only a few fold groups described. The sequence similarity varies greatly between enzymes performing the same enzymatic reaction, but the fold can still be the same. GTs have also been used to evaluate different fold recognitions methods, and the fold knowledge has been used in attempts to identify new GTs in organisms, e.g. Mycobacterium tuberculosis (42). In the latter study, multivariate methods were used to analyze a large data set, including fold prediction results, as well as molecular mass and theoretical isoelectric points (42). In the present study a different approach was developed. Glycosyltransferases of known structure were analyzed by translating the amino acid sequence into z-scores describing their physicochemical properties (19), followed by comparing property patterns between the sequences. Of course, some other supervised classification methods that are applicable for problems with more variables than objects could have been used here. For example, support vector machines have successfully been applied by Chou and co-workers (43, 44) to predict structural domains in whole proteins. However, because at present, we are unaware of a more general comparison between the PLS-DA and the support vector machine classifier, we will not speculate on the outcome of a change of classification method, but most advanced classification methods if correctly trained usually give rather similar results.

Conservation of Properties—The main idea in this study was that properties are more conserved than amino acid species, leading to conservation of structures (cf. Mirny and Shakhnovich (33)). Describing and comparing the GTs based on property patterns worked well both for fold type classification and stereochemistry of the enzyme product sugar linkage (Figs. 2 and 3). The division between different fold types was better resolved when only analyzing 2-fold types at the time, especially when membrane-spanning group GT-C was included. When all 3-fold types were separated in the same model, a partial overlap between GT-A and GT-B was seen. Both GT-A and GT-B consist of Rossman-like folds, with alternating alpha helices and beta strands. This was recognized by the method, and the difference between these 2-fold types was smaller than the difference between globular proteins and transmembrane ones with a high TM helix content (i.e. GT-A plus GT-B versus GT-C) (Fig. 2A). Amphipathic helices were not interfering, hence property patterns are very different for this transmembrane group compared with the cytosolic and surface-bound proteins.

Evaluation of Genome Data—The multivariate method was also used to predict the fold for all GTs in E. coli and Synechocystis included in CAZy (Tables II and III). In E. coli the proportion of GTs with established functions is high, and Synechocystis is an organism with very high GT content. The data set consists of GTs belonging to the same GT families as the reference set but also from other families. This method works best with proteins from families included in the reference set; it becomes easier for the program to predict the membership probability if the predicted protein shares some homology with the proteins in the reference set, most evident for the GT-A and GT-B fold groups.

The multivariate data analysis method used here recognized the hydrophobic helices and grouped these proteins according to the TM helix content to the GT-C family, even when the dominating part of the protein had the GT-A or GT-B fold. In the E. coli set, seven GTs with high TM content were predicted to belong to the GT-C group (Table II). These proteins are annotated in CAZy families 2 (GT-A fold) and 51 (indifferent) and are not considered part of the GT-C group (4). Five other E. coli proteins containing transmembrane segments were not grouped to this latter fold group. Note that the TM content within the same GT family seems to vary greatly, and TM segments are not accounted for in CAZy. Despite this, fold and reaction mechanism could properly be predicted for proteins grouped to other GT families than the major ones.

The predictions were consistent, independent of the fold families included in the separation analysis and independent whether 2- or 3-fold classes were used. The results obtained from the MetaServer were used as a benchmark comparison. When predicting the fold, with all 3-fold groups in the reference set, 82% of the E. coli GTs were predicted to have the right fold and, excluding the GT-C group, increased the prediction correctness to 86% (Table III). In the first fold prediction, recognition of TM segments was regarded as the correct result. The scores were improved when the GT-C group was removed from the prediction, hence the TM content is no longer accounted for (Table III). Predicted glycosyltransferases from Synechocystis, was also analyzed by the same method, but even though the fraction of GTs belonging to the major families GT2 and GT4 are larger, the prediction ability was somewhat lower (Table III). However, most of these are not as well studied as the E. coli enzymes.

Retaining and Inverting Mechanisms—The ability to predict retaining and inversion mechanisms was even higher than the structural predictions, 100% for the E. coli and 86% for the Synechocystis set within the GT-A fold group, and 60% versus 80% for the GT-B group. The same comparison for the GT-C group cannot be made, because it seems to consist only of inverting enzymes. Little is understood about the differences at the sequence level between inversion and retention. The importance of acidic residues in the active site has been suggested, but exact conserved positions have not been established (41). Extensive sequence alignments revealed here that there are charge differences in the UDP-binding motif within the GT-B fold family (GT4, GT20 and GT1, GT28, respectively; data not shown). The conserved EX7E motif has an Asp or Glu in the first position in GT4 and GT-20 (retaining), but the inverting enzymes in GT-1 and GT-28 have a His, Lys, or Arg. The Lys-261 in MurG is known to bind to a phosphate in the UDP-GalNAc donor (36), His-293 in GtfA is known to have the same function (45), hence there is a charge difference between the two reaction mechanism within this motif.

The established alignment, the UDP-binding region was further analyzed to search for differences between the inverting/retaining clans in the GT-B family. The differences between the inversion enzyme clan within the GT-A group were too high and no alignment could be established, and choosing only a few families could give incorrect results. Within the GT-B group there are three solved crystal structures for the inversion group. There is one solved structure (OtsA) within the retaining enzymes, and 3-fold models within the retaining GT4 family. The OtsA structure could not be aligned over the active site with the members of GT4 due to longer loops and helixes in OtsA than the other enzymes (46). However, a comparison between the two different reaction mechanisms could be achieved at the sequence level and even on the structure level. The conserved positive variable patterns within the inverting clan (Fig. 4) were found to be superimposed when comparing the structures within the GT-B fold group (Fig. 5). The best comparison could be made for the inverting enzyme group where the three different crystal structures have been solved, GftA (45), GtfB (35), and MurG (36, 47). GtfA was not used in the property alignment due to a different donor sugar nucleotide (45). A preliminary comparison could also be made between inverting and retaining mechanisms within the GT-B group. When superimposing a fold model of alDGS (retaining) on to GtfB and MurG, the corresponding negative variable pairs in the retaining group (Fig. 4) were found to be located in the same area (Fig. 5). The positions in MurG (inverting, GT-B fold) that are known to bind UDP-GlcNAc, Ala-264, Leu-265, Thr-266, Glu-269, Gln-288, and Gln-289 (36), are located within the alignment (Fig. 4), hence the property pattern positions might be involved in guiding the donor substrate to the right orientation. They can also be important for the right structure of the {alpha}/{beta}/{alpha} motif of the active site (47). The positions in the GT-A retaining group were also found around the nucleotide binding area (Fig. 4) and may have similar functions (34). The property pattern positions from this alignment could also be superimposed at the structure level, indicating a functional importance (data not shown). These GT-B and GT-A sites are good targets for functional (mutational) analyses.

Conclusions—The ACC/PLS (multivariate) method that we describe here structurally classifies and identifies glycosyltransferases with high accuracy, on the basis of amino acid sequence information. The method can even be used for predicting the stereochemistry of the reaction mechanism. Potential separating sequence parameters between the inverting and retaining mechanism have also been suggested. The positions found to be conserved within fold groups performing the same stereochemical reaction are good candidates for mutagenesis, to better understand the differences between the two reaction types. This study has also identified four proteins containing more than one fold type, i.e. Synechocystis slr1816, sll1528, slr0626, and slr1063. Multivariate analysis of all GTs annotated in the CAZy data base may find new fold groups. Detailed analyses of large GT sequence families, such as GT-2 and GT-4, could potentially also find subgroups related to specific substrates, products, and small structure differences such as high TM content.


    FOOTNOTES
 
* This work was supported by the Swedish Natural Science Research Council. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. Back

The on-line version of this article (available at http://www.jbc.org) contains Supplemental Table SI. Back

To whom correspondence should be addressed. Tel.: 46-8-16-24-63; Fax: 46-8-15-36-79; E-mail: ake{at}dbb.su.se.

1 The abbreviations used are: GT, glycosyltransferase; CAZy, carbohydrate active enzymes; SGC, SpsA GnT I core; TM, transmembrane; ORF, open reading frame; ACC, autocross-covariance; PLSDA, partial least squares projections to latent structures discriminant analysis. Back

2 T. Rajalahti, F. Huang, M. Sjöström, Å. Wieslander, and B. Norling, manuscript in preparation. Back


    ACKNOWLEDGMENTS
 
We thank Dr. Anders Öhman, Umeå University, for his help with the fold analysis, and the Swedish Natural Science Research Council for financial support.



    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 EXPERIMENTAL PROCEDURES
 RESULTS
 DISCUSSION
 REFERENCES
 

  1. Boix, E., Swaminathan, G. J., Zhang, Y., Natesh, R., Brew, K., and Acharya, K. R. (2001) J. Biol. Chem. 276, 48608-48614[Abstract/Free Full Text]
  2. Bourne, Y., and Henrissat, B. (2001) Curr. Opin. Struct. Biol. 11, 593-600[CrossRef][Medline] [Order article via Infotrieve]
  3. Breton, C., and Imberty, A. (1999) Curr. Opin. Struct. Biol. 9, 563-571[CrossRef][Medline] [Order article via Infotrieve]
  4. Liu, J., and Mushegian, A. (2003) Protein Sci. 12, 1418-1431[Abstract/Free Full Text]
  5. Kikuchi, N., Kwon, Y. D., Gotoh, M., and Narimatsu, H. (2003) Biochem. Biophys. Res. Commun. 310, 574-579[CrossRef][Medline] [Order article via Infotrieve]
  6. Unligil, U. M., and Rini, J. M. (2000) Curr. Opin. Struct. Biol. 10, 510-517[CrossRef][Medline] [Order article via Infotrieve]
  7. Breton, C., Mucha, J., and Jeanneau, C. (2001) Biochimie (Paris) 83, 713-718
  8. Unligil, U. M., Zhou, S., Yuwaraj, S., Sarkar, M., Schachter, H., and Rini, J. M. (2000) EMBO J. 19, 5269-5280[CrossRef][Medline] [Order article via Infotrieve]
  9. Campbell, J. A., Davies, G. J., Bulone, V., and Henrissat, B. (1997) Biochem. J. 326, 929-939[Medline] [Order article via Infotrieve]
  10. Henrissat, B., and Davies, G. J. (2000) Plant Physiol. 124, 1515-1519[Free Full Text]
  11. Davies, G., and Henrissat, B. (1995) Structure 3, 853-859[Medline] [Order article via Infotrieve]
  12. Henrissat, B., and Davies, G. (1997) Curr. Opin. Struct. Biol. 7, 637-644[CrossRef][Medline] [Order article via Infotrieve]
  13. Gebler, J., Gilkes, N. R., Claeyssens, M., Wilson, D. B., Beguin, P., Wakarchuk, W. W., Kilburn, D. G., Miller, R. C., Jr., Warren, R. A., and Withers, S. G. (1992) J. Biol. Chem. 267, 12559-12561[Abstract/Free Full Text]
  14. Rackovsky, S. (1998) Proc. Natl. Acad. Sci. U. S. A. 95, 8580-8584[Abstract/Free Full Text]
  15. Wold, S., and Sjöström, M. (1998) Acta Chem. Scand. 52, 517-523
  16. Wold, S., Jonsson, M., Sjöström, M., Sandberg, M., and Rännar, S. (1993) Anal. Chim. Acta 277, 239-253
  17. Sjöström, M., Rännar, S., and Wieslander, Å. (1995) Chemometrics Intell. Lab. Syst. 29, 295-305[CrossRef]
  18. Edman, M., Jarhede, T., Sjöström, M., and Wieslander, A. (1999) Proteins 35, 195-205[CrossRef][Medline] [Order article via Infotrieve]
  19. Hellberg, S., Sjöström, M., Skagerberg, B., and Wold, S. (1987) J. Med. Chem. 30, 1126-1135[CrossRef][Medline] [Order article via Infotrieve]
  20. Wold, S., Eriksson, L., and Sjöström, M. (1998) in Encyclopedia of Computational Chemistry (Schleyer, v. R., ed) pp. 2006-2022, John Wiley & Sons, New York
  21. Eriksson, L., Johansson, E., Kettaneh-Wold, N., and Wold, S. (2001) Multi- and Megavariate Data Analysis Principles and Applications, Umetrics, Umeå
  22. Thompson, J. D., Higgins, D. G., and Gibson, T. J. (1994) Nucleic Acids Res. 22, 4673-4680[Abstract/Free Full Text]
  23. Guex, N., and Peitsch, M. C. (1997) Electrophoresis 18, 2714-2723[CrossRef][Medline] [Order article via Infotrieve]
  24. Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E. L. (2001) J. Mol. Biol. 305, 567-580[CrossRef][Medline] [Order article via Infotrieve]
  25. Gastinel, L. N., Bignon, C., Misra, A. K., Hindsgaul, O., Shaper, J. H., and Joziasse, D. H. (2001) EMBO J. 20, 638-649[CrossRef][Medline] [Order article via Infotrieve]
  26. Patenaude, S. I., Seto, N. O., Borisova, S. N., Szpacenko, A., Marcus, S. L., Palcic, M. M., and Evans, S. V. (2002) Nat. Struct. Biol. 9, 685-690[CrossRef][Medline] [Order article via Infotrieve]
  27. Persson, K., Ly, H. D., Dieckelmann, M., Wakarchuk, W. W., Withers, S. G., and Strynadka, N. C. (2001) Nat. Struct. Biol. 8, 166-175[CrossRef][Medline] [Order article via Infotrieve]
  28. Gastinel, L. N., Cambillau, C., and Bourne, Y. (1999) EMBO J. 18, 3546-3557[CrossRef][Medline] [Order article via Infotrieve]
  29. Pedersen, L. C., Tsuchida, K., Kitagawa, H., Sugahara, K., Darden, T. A., and Negishi, M. (2000) J. Biol. Chem. 275, 34580-34585[Abstract/Free Full Text]
  30. Tarbouriech, N., Charnock, S. J., and Davies, G. J. (2001) J. Mol. Biol. 314, 655-661[CrossRef][Medline] [Order article via Infotrieve]
  31. Horne, D. S. (1988) Biopolymers 27, 451-477[CrossRef][Medline] [Order article via Infotrieve]
  32. Brändén, C.-I., and Tooze, J. (1999) Introduction to Protein Structure, 2nd Ed., Garland, New York
  33. Mirny, L. A., and Shakhnovich, E. I. (1999) J. Mol. Biol. 291, 177-196[CrossRef][Medline] [Order article via Infotrieve]
  34. Heissigerova, H., Breton, C., Moravcova, J., and Imberty, A. (2003) Glycobiology 13, 377-386[Abstract/Free Full Text]
  35. Mulichak, A. M., Losey, H. C., Walsh, C. T., and Garavito, R. M. (2001) Structure 9, 547-557[Medline] [Order article via Infotrieve]
  36. Hu, Y., Chen, L., Ha, S., Gross, B., Falcone, B., Walker, D., Mokhtarzadeh, M., and Walker, S. (2003) Proc. Natl. Acad. Sci. U. S. A. 100, 845-849[Abstract/Free Full Text]
  37. Edman, M., Berg, S., Storm, P., Wikström, M., Vikström, S., Öhman, A., and Wieslander, Å. (2003) J. Biol. Chem. 278, 8420-8428,[Abstract/Free Full Text]
  38. Wakarchuk, W. W., Cunningham, A., Watson, D. C., and Young, N. M. (1998) Protein Eng. 11, 295-302[Abstract/Free Full Text]
  39. Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L. (2003) Bioinformatics 19, 1015-1018[Abstract/Free Full Text]
  40. Blatch, G. L., and Lassle, M. (1999) BioEssays 21, 932-939[CrossRef][Medline] [Order article via Infotrieve]
  41. Franco, O. L., and Rigden, D. J. (2003) Glycobiology 13, 707R-712R
  42. Wimmerova, M., Engelsen, S. B., Bettler, E., Breton, C., and Imberty, A. (2003) Biochimie (Paris) 85, 691-700
  43. Chou, K. C., and Cai, Y. D. (2002) J. Biol. Chem. 277, 45765-45769[Abstract/Free Full Text]
  44. Cai, Y. D., Lin, S. L., and Chou, K. C. (2003) Peptides 24, 159-161[CrossRef][Medline] [Order article via Infotrieve]
  45. Mulichak, A. M., Losey, H. C., Lu, W., Wawrzak, Z., Walsh, C. T., and Garavito, R. M. (2003) Proc. Natl. Acad. Sci. U. S. A. 100, 9238-9243[Abstract/Free Full Text]
  46. Gibson, R. P., Tarling, C. A., Roberts, S., Withers, S. G., and Davies, G. J. (2003) J. Biol. Chem. 279, 1950-1955[Medline] [Order article via Infotrieve]
  47. Ha, S., Walker, D., Shi, Y., and Walker, S. (2000) Protein Sci. 9, 1045-1052[Abstract]

Add to CiteULike CiteULike   Add to Complore Complore   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
GlycobiologyHome page
C. Breton, L. Snajdrova, C. Jeanneau, J. Koca, and A. Imberty
Structures and mechanisms of glycosyltransferases
Glycobiology, February 1, 2006; 16(2): 29R - 37R.
[Abstract] [Full Text] [PDF]


Home page
Plant Cell PhysiolHome page
G. Holzl, U. Zahringer, D. Warnecke, and E. Heinz
Glycoengineering of Cyanobacterial Thylakoid Membranes for Future Studies on the Role of Glycolipids in Photosynthesis
Plant Cell Physiol., November 1, 2005; 46(11): 1766 - 1778.
[Abstract] [Full Text