Single-subunit oligosaccharyltransferases of Trypanosoma brucei display different and predictable peptide acceptor specificities

Trypanosoma brucei causes African trypanosomiasis and contains three full-length oligosaccharyltransferase (OST) genes; two of which, TbSTT3A and TbSTT3B, are expressed in the bloodstream form of the parasite. These OSTs have different peptide acceptor and lipid-linked oligosaccharide donor specificities, and trypanosomes do not follow many of the canonical rules developed for other eukaryotic N-glycosylation pathways, raising questions as to the basic architecture and detailed function of trypanosome OSTs. Here, we show by blue-native gel electrophoresis and stable isotope labeling in cell culture proteomics that the TbSTT3A and TbSTT3B proteins associate with each other in large complexes that contain no other detectable protein subunits. We probed the peptide acceptor specificities of the OSTs in vivo using a transgenic glycoprotein reporter system and performed glycoproteomics on endogenous parasite glycoproteins using sequential endoglycosidase H and peptide:N-glycosidase-F digestions. This allowed us to assess the relative occupancies of numerous N-glycosylation sites by endoglycosidase H-resistant N-glycans originating from Man5GlcNAc2-PP-dolichol transferred by TbSTT3A, and endoglycosidase H-sensitive N-glycans originating from Man9GlcNAc2-PP-dolichol transferred by TbSTT3B. Using machine learning, we assessed the features that best define TbSTT3A and TbSTT3B substrates in vivo and built an algorithm to predict the types of N-glycan most likely to predominate at all the putative N-glycosylation sites in the parasite proteome. Finally, molecular modeling was used to suggest why TbSTT3A has a distinct preference for sequons containing and/or flanked by acidic amino acid residues. Together, these studies provide insights into how a highly divergent eukaryote has re-wired protein N-glycosylation to provide protein sequence-specific N-glycan modifications. Data are available via ProteomeXchange with identifiers PXD007236, PXD007267, and PXD007268.

The tsetse-fly-transmitted protozoan parasite Trypanosoma brucei and its close relatives are responsible for human and animal African trypanosomiasis. The animal-infecting bloodstream forms of these organisms depend on surface coats made of glycosylphosphatidylinositol (GPI) 3 -anchored and N-glycosylated variant surface glycoprotein (VSG) to evade the innate host immune system (1) and the acquired immune system through antigenic variation (2). Furthermore, they express many less abundant glycoproteins such as their novel transferrin receptors (3)(4)(5), a novel lysosomal/endosomal protein called p67 (6), the so-called invariant surface glycoproteins (ISGs) (7), and invariant endoplasmic reticulum glycoproteins (8), the Golgi/lysosomal glycoprotein tGLP-1 (9), the membrane-bound histidine acid phosphatase TbMBAP1 (10), the flagellar adhesion zone glycoproteins Fla1 and Fla2 (11), and others. Whereas some of these are bloodstream-form specific glycoproteins (VSGs, ISGs, TbMBAP1, and transferrin receptors), others are common to the tsetse midgut-dwelling procyclic form of the parasite. Furthermore, procyclic form parasites also express unique glycoproteins, notably the abundant GPIanchored procyclins, some of which are N-glycosylated (12,13), and the partially characterized high-molecular-weight glycoconjugate (14,15). Many of the N-glycan structures expressed by T. brucei have been solved, and these include conventional oligomannose and biantennary complex structures as well as paucimannose and extremely unusual "giant" poly-N-acetyllactosamine (poly-LacNAc) containing complex structures in the bloodstream form of the parasite (16 -19). In contrast, only oligomannose N-glycans have been structurally described in wild-type procyclic trypanosomes (12,20). The unusual repertoire of the T. brucei bloodstream-form N-glycans and the original observation by Bangs et al. (21) that Endo H-resistant N-glycans appear immediately following protein synthesis, and not following transport to the Golgi apparatus, has stimulated our group to study the fundamentals of protein N-glycosylation in this divergent eukaryotic pathogen.
Protein N-glycosylation is believed to be a ubiquitous posttranslational modification among the eukaryotes, with the canonical model based primarily on extensive studies in mammalian cells and the yeast Saccharomyces cerevisiae (22,23). In this canonical model, there are a number of tenets that include: (i) The mature lipid-linked oligosaccharide (LLO) donor, and the preferred substrate for oligosaccharyltransferases (OSTs) is Glc 3 Man 9 GlcNA 2 -PP-dolichol. (ii) The OSTs are hetero-oligomers of 8 or 9 distinct subunits. (iii) The OSTs may fall into two classes (A and B) according to their subunit composition with different peptide acceptor specificities. (iv) The ER enzyme UDP-glucose-glycoprotein glucosyltransferase (UGGT) operates on tri-antennary Man 9 -7 Glc NA 2 structures with a complete a-branch, but not on bi-antennary structures. (v) The action of Golgi mannosidase II is a pre-requisite for the conversion of oligomannose to complex N-glycans. (vi) The enzymes GnTI and GnTII have strict acceptor substrate specificities, operating on Man 5 GlcNA 2 and GlcNAcMan 3 GlcNA 2 , respectively. However, none of these tenets apply to N-glycosylation in the protozoan parasite T. brucei where the following points have been noted. (i) The largest LLO is Man 9 GlcNA 2 (24,25) and different OSTs preferentially transfer either this structure or bi-antennary Man 5 GlcNA 2 (25)(26)(27)(28). (ii) There is no evidence for OST subunits other than the catalytic SST3 subunits in T. brucei or the related parasites Trypanosoma cruzi and the Leishmania (26, 28 -32). (iii) OST subclasses and their peptide acceptor specificities (which are more disparate in T. brucei than for other eukaryotes) are defined only by their STT3 components (28). (iv) The parasite UGGT works efficiently on all structures (biand tri-antennary) with an intact a-branch (28). (v) Golgi-mannosidase II is absent, preventing the conversion form the oligomannose series to the complex series of N-glycans (25). (vi) The parasite GnTI and GnTII ␤GlcNAc-transferases have different specificities to canonical GnTIs and GnTIIs and belong to a different glucosyltransferase family (33,34).
Here, we present the following. (i) We directly address the oligomeric states of T. brucei STT3 subunits. (ii) We look for evidence for any non-canonical OST subunits in addition to the TbSTT3s. (iii) We probe the peptide acceptor specificities of TbSTT3A (Tb927.5.890) and TbSTT3B (Tb927.5.900) using a reporter glycoprotein expression system and by glycoproteomics. (iv) We use machine learning to predict which putative N-glycosylation sites in bloodstream-form T. brucei will be modified by TbSTT3A or TbSTT3B.

Blue native gel electrophoresis of in situ tagged TbSTT3A suggests it is present in high molecular weight complexes
To enable immunoprecipitation of TbSTT3A, the 3Ј-end of the endogenous gene in a heterozygote cell line (TbSTT3A/B/ C ϩ/Ϫ ) was tagged in situ with a sequence encoding a C-terminal HA 3 epitope. Transfected cells were cloned and analyzed by Southern blotting to confirm correct insertion of the tag (supplemental Fig. S1A). To check that the tag did not impair the function of TbSTT3A, the glycosylation of VSG221 was analyzed in these cells. VSG221 receives different types of glycan at the two N-glycosylation sequons in the protein: Endo H-resistant Man 5 GlcNAc 2 at Asn-263 and Endo H-sensitive Man 9 GlcNAc 2 at Asn-428 (17,28). Following PNGaseF and Endo H treatment, the typical digestion pattern for wild-type VSG221 was seen also for the transgenic cells (supplemental Fig. S1B), showing that C-terminally tagged TbSTT3A-HA 3 is functional. Subsequently, cells were lysed under mild conditions (0.5% digitonin on ice for 30 min), and the clarified cell lysates were incubated with anti-HA mouse antibody, followed by magnetic beads coupled to protein G. The pull-out eluates were analyzed by SDS-PAGE and Western blotting using an anti-HA antibody. As expected, epitope-tagged TbSTT3A-HA 3 was detected running just below the position of the 75-kDa molecular mass marker in the TbSTT3A-HA 3 pull-out, whereas no band was seen in the wild-type cell pull-out (Fig.  1A). However, when the same eluates were analyzed by blue native gel electrophoresis and anti-HA Western blotting, a smear (specific for the TbSTT3A-HA 3 cell line) was detected between 700 and 1200 kDa, suggesting that TbSTT3A is present in large complexes (Fig. 1B).

SILAC proteomics shows TbSTT3A and TbSTT3B form hetero-oligomeric complexes without other subunits
Because the results from the blue native gel electrophoresis suggested TbSTT3A is present in high molecular weight complexes, we carried out pull-out experiments using SILAC. For this experiment, wild-type and transgenic parasites (expressing TbSTT3A-HA 3 ) were grown under identical conditions for eight cell divisions, except that the transgenic TbSTT3A-HA 3 cell line was grown in "heavy medium" containing stable isotope-labeled Lys and Arg (R 6 K 4 ), whereas the wild-type cells were grown in "light medium" containing unlabeled Lys and Arg (R 0 K 0 ). The transgenic TbSTT3A-HA 3 and the wild-type cells were harvested, washed, counted, mixed together in a 1:1 ratio, and lysed in 0.5% digitonin buffer. Anti-HA antibodies and protein G magnetic beads were used to pull-out the TbSTT3A-HA 3 -tagged protein and any binding partners, and the bead eluate was processed to tryptic peptides for LC-MS/MS analysis. In this kind of SILAC experiment, TbSTT3A-HA 3 and any proteins specifically associated with it can be distinguished from non-specific contaminant proteins by the isotope ratios of their tryptic peptides. Thus, TbSTT3A-HA 3 and true associated protein peptides will have high heavy/light isotope ratios,

N-Glycosylation in Trypanosoma brucei
whereas contaminant proteins will have approximately equal heavy/light isotope ratios ( Fig. 2A). The data set from the experiment was used to search a T. brucei predicted protein database using MaxQuant software. Each protein was displayed on a plot of the log 10 value of the intensities of the unique peptides of that protein (y axis) and the log 2 value of the heavy to light isotope Overview of the SILAC pull-out experiment and plot of proteomics data. A, overview of the SILAC experiment. Wild-type bloodstream-form cells were grown in light (R0K0) medium, and cells expressing in situ-tagged TbSTT3A-HA3 were labeled with heavy (R6K4) medium. The cells were mixed 1:1 and lysed with digitonin, and TbSTT3A-HA3 was enriched by affinity selection on anti-HA magnetic beads. Peptides from TbSTT3A-HA3 and genuine binding proteins have high heavy/light isotope ratios, whereas those from contaminants will have ratios close to 1:1 (log2 ϭ 0). B, results from the TbSTT3A SILAC pull-out experiment. The plot shows the log2 of the heavy-to-light isotope ratio (x axis) versus the log10 value of the intensities of the peptides belonging to each protein that was detected (y axis). The black curves (marked sigma 3) represent three standard deviations from the mean. Proteins plotted in orange have a heavy-to-light ratio above the sigma 3 cut off and are significantly enriched. TbSTT3AHA3 (bait) and TbSTT3B (both annotated) were shown to be highly enriched and are highlighted in red. C, results from the TbSTT3B-Myc3 SILAC pull-out experiment. The plot is the same as in B except that in situ-tagged TbSTT3B-Myc3 was used as bait. Again, TbSTT3A and TbSTT3A (both annotated and highlighted in red) were significantly enriched.

N-Glycosylation in Trypanosoma brucei
ratios of the same peptides (x axis) (Fig. 2B). The TbSTT3A-HA 3 (bait) protein had the highest heavy/light ratio (14:1), closely followed by TbSTT3B (10:1). Only three other proteins were significantly enriched (orange crosses, Fig. 2B). However, these were only marginally (1.5-fold) enriched hits that are not known to localize to the ER.
The TbSTT3A-HA 3 cell line was further modified by the in situ tagging of the remaining TbSTT3B allele, to yield a cell line expressing C-terminally Myc 3 -tagged TbSTT3B-Myc 3 . A complementary SILAC experiment, using in situ TbSTT3B-Myc 3tagged bait and an anti-Myc pull-out, produced similar results to the TbSTT3A-HA 3 pull-out, with TbSTT3A being the only obvious binding partner for TbSTT3B-Myc 3 (Fig. 2C). In this case, there was one other significant protein hit (orange cross, Fig. 2C), corresponding to a glucose transporter, but this was different from those seen in Fig. 2B, and also unlikely to be an ER component.
Taken together, these data suggest that TbSTT3A and TbSTT3A form hetero-oligomeric complexes, with no other candidate subunits, although we cannot rule out the presence of low-affinity subunits that might be lost during immunoprecipitation. The data for the SILAC proteomics experiments can be found at ProteomeXchange under entry PXD007236.

Co-immunoprecipitation of TbSTT3A-HA3 and TbSTT3B-Myc 3
The results from the SILAC pull-out experiments suggested that TbSTT3A is in a complex with TbSTT3B, and to further test this hypothesis, immunoprecipitation (IP) experiments were performed. Cells from the double-tagged cell line were harvested, washed, and lysed in 0.5% digitonin, and TbSTT3A-HA 3 or TbSTT3B-Myc 3 was captured from the lysate using anti-HA or anti-Myc magnetic beads. Subsequently, the tagged proteins were detected by Western blotting using anti-HA and anti-Myc antibodies. Wild-type cell lysates (containing no HAor Myc-tagged genes) were used as a control. The results from the co-IP experiments are shown in (Fig. 3). As expected, no bands in the region of TbSTT3A-HA 3 or TbSTT3B-Myc 3 were seen in the IPs from the control wild-type lysates (Fig. 3, lanes 1,  3, 5, and 7). Also, as expected, TbSTT3A-HA 3 and TbSTT3B-Myc 3 were detected in the homologous anti-HA IP/anti-HA blot (Fig. 3, lane 2) and anti-Myc IP/anti-Myc blot (Fig. 3, lane  8). Significantly, TbSTT3B-Myc 3 can be seen to co-IP with TbSTT3A-HA 3 (Fig. 3, lane 6), and TbSTT3A-HA 3 can be seen to co-IP with TbSTT3B-Myc 3 (Fig. 3, lane 4), confirming their physical association predicted by the SILAC experiment.
In these experiments, the anti-HA IP/anti-HA Western blot signal for TbSTT3A-HA 3 is much stronger than the anti-Myc IP/anti-Myc Western blot signal for TbSTT3B-Myc 3 . Whereas some of this difference may be due to relative antibody affinities, it is also consistent with the higher expression of TbSTT3A in wild-type bloodstream-form trypanosomes at both the mRNA and protein levels (28,35). The co-IP data suggest that a significant proportion of the total TbSTT3B-Myc 3 appears in the TbSTT3A-HA 3 IP (Fig. 3, compare lanes 6 and 8), whereas only a minority of TbSTT3A-HA 3 appears in the TbSTT3B-Myc 3 IP (Fig. 3, compare lanes 2 and 4). These data suggest that there may be some high molecular weight complexes made exclusively, or almost exclusively, of TbSTT3A, whereas all, or most, of the TbSTT3B is present in complexes containing TbSTT3A .

Probing peptide acceptor substrate specificities of TbSTT3A and TbSTT3A using a reporter glycoprotein expression system
TbSTT3A is responsible for co-translational transfer of biantennary Man 5 GlcNAc 2 predominantly to N-glycosylation sequons containing and/or flanked by acidic amino acids, whereas TbSTT3B catalyzes post-translational transfer of triantennary Man 9 GlcNAc 2 to the remaining sterically accessible sequons (28). To improve our understanding of the acceptor peptide specificity in T. brucei, an in vivo assay was established using an artificial reporter glycoprotein, based on the TbBiPN system described previously (36). TbBiPN is a non-glycosylated truncated version of TbBiP, retaining its N-terminal signal peptide but lacking its C-terminal ER retention peptide, which enters the ER and is eventually secreted out of the cell via the Golgi apparatus. A pLEW82 expression plasmid (37) was modified to contain the TbBiPN open reading frame fused to a 3Ј-sequence into which we could insert additional sequences, via AvrII and MfeI restriction sites, immediately upstream of a C-terminal HA 3 epitope tag. This construct was used to introduce sequences encoding a single reporter N-glycosylation sequon, flanked by five amino acid residues on each side. These TbBiPN-(XXXXXNXTXXXXX)-HA 3 constructs were transformed into bloodstream-form trypanosomes to express the reporter glycoprotein.
We first validated the in vivo reporter assay by introducing TEGLLNATDEIAL and TILKSNYTAEPVR into the TbBiPN construct and expressing them in T. brucei. The former sequence (with a pI of 3.42) is found in VSG MITat1.8 and is known, in that context, to receive exclusively biantennary Man 5 GlcNAc 2 from TbSTT3A (38), whereas the latter (with a pI of 8.3) is found in the ESAG6 subunit of the transferrin receptor and is known not to be recognized and modified by TbSTT3A, and therefore, it receives exclusively triantennary Man 9 GlcNAc 2 from TbSTT3B (5). Aliquots of trypanosome lysates expressing these constructs were treated with and without Endo H or PNGaseF, followed by SDS-polyacrylamide gel and Western blotting with anti-HA antibodies. The endoglyco-

N-Glycosylation in Trypanosoma brucei
sidase Endo H can only digest triantennary Man 9 GlcNAc 2 glycans transferred by TbSTT3B, whereas PNGaseF can digest both Man 9 GlcNAc 2 and biantennary Man 5 GlcNAc 2 transferred by TbSTT3A. Thus, distinct digestion patterns, depending on what type(s) of glycan(s) are bound to the sequon asparagine, can be visualized by Western blotting. The in vivo reporter assay faithfully recapitulated the experimental data for the VSG MITat1.8 and ESAG6 glycosylation sites, with TEGLLNATDEIAL-and TILKSNYTAEPVR-containing TbBipN glycoproteins occupied predominantly by Endo H-resistant and Endo H-sensitive glycans, respectively (supplemental Fig. S2).
Next, we investigated how each position flanking and within the sequon affects TbSTT3A recognition and transfer. First, the neutral sequence AAAAANATAAAAA (pI 6.01) was introduced into the reporter glycoprotein. For this construct, the majority (about 93% as measured by quantitative Licor imaging of the upper and lower bands of the Endo H digests) of the anti-HA binding signal was sensitive to Endo H (Fig. 4A, 1st lane; Table 1). One aspartic acid was then introduced in all 11 possible positions, yielding peptides with the same pI value (3.10), and the proportion of the reporter glycoprotein processed by TbSTT3A (and therefore resistant to Endo H) was measured (Fig. 4A, 2nd and 3rd lanes; Table 1). The quantitative data, derived from two technical replicates of three biological replicates (Table 1), are summarized in Fig. 4B. From these data, it can be seen that the aspartic acid scan across the different positions leads to variation in recognition and glycan transfer by TbSTT3A, with the two positions immediately flanking the Asn residue apparently having the great-est influence and with residues N-terminal to the glycosylation sequon having greater influence to those C-terminal to the sequon.

Probing endogenous peptide acceptor substrate specificities of TbSTT3A and TbSTT3B by glycoproteomics
The glycoproteomics data from Ref. 28 were reprocessed and significantly augmented by combining them with data derived from the experiments outlined under "Experimental procedures." These experiments assess whether the endogenous trypanosome N-glycosylation sites are occupied by Endo H-sensitive oligomannose glycans (originating from the action of TbSTT3B) or by Endo H-resistant paucimannose and/or complex glycans (originating from the action of TbSTT3A).
Parasites were osmotically lysed to release Ͼ90% of the VSG coat as soluble form VSG, through the action of the endogenous GPI-specific phospholipase C (39,40). The recovered cell ghosts, containing the majority of the non-VSG cellular glycoproteins, were solubilized, denatured, and S-alkylated. This preparation was then processed in two ways. In one approach, the intact glycoproteins were first affinity-purified using immobilized ricin (RCA 120 ) and concanavalin-A (ConA) lectins. The enriched glycoproteins were then sequentially digested with Endo H and PNGaseF (the latter in the presence of H 2 [ 18 O]), and digested with Lys-C and trypsin. In the other approach the denatured and S-alkylated proteins were first digested with Lys-C and trypsin, and the glycopeptides were trapped with ricin (RCA 120 ) and ConA and subsequently digested with Endo H and PNGaseF (the latter in the presence of H 2 [ 18 O]). In both cases, the resulting peptides were analyzed by LC-MS/MS, and the data were used to search the T. brucei predicted protein database allowing for the possible presence of Asn-N-GlcNAc residues, the product of Endo H cleavage, and/or for the conversion of Asn residues into [ 18 O]Asp residues, the product of PNGaseF cleavage. These data and reprocessed data from Izquierdo et al. (28), are shown in supplemental Table S1. Peptides containing Asn-N-GlcNAc and/or [ 18 O]Asp within an Asn-Xaa-Ser/Thr sequon that were detected Ն3 times were assigned as being predominantly TbSTT3A or TbSTT3B substrates when the proportion of the [ 18 O]Asp feature was Ն0.8 and Յ0.4, respectively (supplemental Table S2). We then analyzed the amino acid frequencies immediately adjacent to Nglycosylation sequons of the assigned TbTT3A and TbTT3B  Table 1 were expressed in bloodstream-form trypanosomes, and the resulting TbBiPN-(XXXXXNXTXXXXX)-HA 3 reporter glycoproteins were visualized in cell lysates by SDS-PAGE and anti-HA Western blotting. Representative examples are shown for the sequences indicated. The proportions of the anti-HA signals that were sensitive (lower bands) and resistant (upper bands) to Endo H were quantified and are reported in Table 1. B, summary of the quantitative data from Table 1 showing the influence of replacing single neutral Ala residues with an acidic Asp residue at each possible position.

N-Glycosylation in Trypanosoma brucei
substrates using WebLogo (41). The enrichment of negatively charged residues is the most striking feature of the TbSTT3A substrates ( Fig. 5A) along with a bias toward Thr over Ser in the sequon ϩ2 position. Conversely, the TbSTT3B substrates are relatively enriched for positively charged residues upstream of the sequon but show no preference for Thr over Ser in the ϩ2 position (Fig. 5B). The data were also analyzed by the two-sample logo visualization method (42), which compares two input peptide sequence lists against each other, highlighting features that predominate in each. This suggests enrichment for negatively charged amino acids, especially at positions Ϫ6, Ϫ3, and ϩ7 for the TbSTT3A substrates and a preference for hydrophobic and positively charge amino acids in the Ϫ1 to Ϫ5 positions of the TbSTT3B substrates (Fig. 5C). The reprocessed data of Ref. 19 can be found at ProteomeXchange under entry PXD007237 and the new glycoproteomics data under PXD007238.

Building a glycosylation site predictor for T. brucei using machine learning
Machine learning has been successfully used in biological research to infer the peptide recognition specificities of, for example, protein kinases, phosphatases, and Src homology 2 domains (43). The first step of machine learning consists of transforming peptide sequences into biochemical features such as charge, hydrophobicity, and relative positions. These features, organized in a machine-readable template, are then evaluated by artificial intelligence algorithms to highlight which amino acid properties of a peptide sequence are the most important in determining substrate recognition. We therefore decided to apply a machine learning approach to further leverage our glycoproteomics and glycoprotein reporter data and to build an ensemble of prediction algorithms to assign putative N-glycosylation sites in the predicted T. brucei proteome. This ensemble algorithm (Voting Classifier) averages the outputs of a Random Forest classifier (RF), an Extra Tree Classifier (ETC), and a Support Vector Machine (SVM) classifier to predict which putative N-glycosylation sites will more likely be modified by TbSTT3A or TbSTT3B. The RF, ETC, and SVM classifiers all weigh the features derived from the upstream (N-terminal) side of the glycosylation sequon more than the features extracted from the downstream (C-terminal) side (supplemental Fig. S3, A-C). Moreover, the RF, ETC, and SVM classifiers all preferentially use the cumulative negative charge upstream of the glycosylated asparagine, and the hydrophobicity of the peptide upstream and downstream, to discriminate between TbSTT3A and TbSTT3B substrates (supplemental Fig. S4). We could detect a core of five important features shared by the three classifiers, namely the charge at pH 7.3 and the isoelectric point for the sequon Ϯ10 amino acid residues, the cumulative charge upstream from the modified asparagine with a window of 13 and 16 amino acids, and the bonus score derived from the aspartic acid scanning experiment described in Fig. 4, see under "Experimental procedures." A list of putative N-glycosylation sites and the Voting Classifier predictions are shown in supplemental Table S3. We performed a two-sample logo visualization on the output (Fig. 5D). As expected, the trends in this plot

N-Glycosylation in Trypanosoma brucei
are similar to those generated from the glycoproteomics data alone (Fig. 5C).

Molecular modeling
Although TbSTT3C (Tb927.5.910) has not been studied in this paper because it is not expressed at detectable levels in bloodstream-form T. brucei, data from its heterologous expression in yeast suggest that its peptide acceptor specificity is much more similar to TbSTT3A than TbSTT3B (28,44). To try to rationalize the preferences of TbSTT3A and TbSTT3C for acceptor sequons containing and/or flanked by negatively charged amino acid residues, we built molecular models of TbSTT3A, TbSTT3B, and TbSTT3C using Phyre2 (45) based on the Campylobacter lari PglB structure (46). The predicted models were aligned with the PglB structure using PDBeFold (47), and the binding pockets of TbSTT3A and TbSTT3B were visualized with Chimera (48). Next, we looked for basic amino acid residues that were conserved in TbSTT3A and TbSTT3C that were different in TbSTT3B (supplemental Fig. S6). Of these, the active-site proximal residue 397 is particularly interesting as it contains a His residue in TbSTT3B but an Arg residue in TbSTT3A and TbSTT3C. Arginine has a flexible and strongly basic guanidinium cation side chain that could conceivably interact with acidic amino acid residues at or close to the NX(S/T) glycosylation sequon in the active site (Fig. 6). Position 406 contains an Arg residue in TbSTT3A and TbSTT3C (in place of a neutral Gly residues in TbSTT3B) that could also conceivably interact with sequon-adjacent anionic residues in the acceptor peptide (Fig. 6). Such ionic interactions between the acceptor peptide and the enzyme surface might increase the efficiency of substrate recognition and glycosylation of sequons containing and/or flanked by acidic amino acids.

Discussion
Although the T. brucei genome does not encode for any identifiable OST subunits, other than three intact and one truncated STT3, we decided to investigate whether there might be novel non-canonical T. brucei OST subunits. Precedents for kinetoplastid-specific subunits of otherwise conserved cellular machineries include clathrin-associating proteins and endocytic components (49,50), exocyst components (50), nuclear pore complex and nuclear lamina components (51), and subunits of the GPI transamidase (52). Blue native gel electrophoresis of gently solubilized epitope-tagged endogenous TbSTT3A showed that it is present in high-molecular-weight complexes, but quantitative SILAC proteomics of tagged TbSTT3A and TbSTT3B pull-outs showed that although these are mutual binding partners (confirmed by co-immunoprecipitation), no other subunits could be found by these methods. The lack of non-canonical or canonical T. brucei OST subunits (other than the STT3 catalytic subunits) is consistent with the ease with which T. brucei and other kinetoplastid STT3s can be functionally expressed in other eukaryotes, like S. cerevisiae and Pichia pastoris (29,31,32,53). Nevertheless, the blue native gel and co-immunoprecipitation experiments show that, at least in the native environment of a bloodstream-form trypanosome, TbSTT3A associates with itself and with the less-abun-dant TbSTT3B to form complexes with apparent molecular masses between 600 kDa and 1.2 MDa. The nature of these complexes remains to be determined but has implications for how and whether TbSTT3s associate with the parasite translocon complex and how they access nascent glycoprotein sequons during and/or following protein translocation. The dimer/oligomer nature of yeast OST subunits, including Stt3, has been previously described (54).
To probe TbSTT3 peptide acceptor substrate specificities, we developed an artificial glycoprotein reporter system, based on a truncated version of TbBiP (36) fused to a single glycosylation sequon flanked by five variable residues on either side. Constructs were expressed in bloodstream-form T. brucei, and their products were assayed for the relative proportions of N-glycosylation by TbSTT3A and TbSTT3B. With this, we were able to recapitulate the preferential N-glycosylations of native peptide acceptor sequences. We then applied the system to analyze the N-glycosylation of an artificial 13-mer sequence (AAAAANATAAAAA) into which we sequentially introduced a single Asp residue in all 11 possible Ala sites. The data clearly confirmed that the presence of an acidic amino acid proximal to the sequon significantly increased its N-glycosylation by TbSTT3A, with the Ϫ1 and ϩ1 positions relative to the N-gly-

N-Glycosylation in Trypanosoma brucei
cosylated Asn residue having the greatest effect and the positions N-terminal to the sequon having a greater effect than those C-terminal to the sequon.
We then created a richer glycoproteomics dataset than we previously reported (28) by combining two alternative approaches: (i) glycoprotein enrichment by lectin affinity chromatography, followed by trypsin digestion and sequential Endo H and PNGaseF digestion; and (ii) tryptic glycopeptide enrichment by lectin affinity chromatography, followed by sequential Endo H and PNGaseF digestion. In both cases, the PNGaseF digestion step was performed in H 2 [ 18 O] to distinguish between PNGaseF-mediated Asn deamidation and non-enzymatic deamidation during sample preparation and handling. These data were combined with reprocessed raw data from Ref. 28 to provide quantitative data on TbSTT3A and TbSTT3B N-glycosylation of 141 unique N-glycosylation sites. Logo plots confirmed the enrichment of acidic amino acid residues (Asp and Glu) surrounding TbSTT3A N-glycosylated sequons, the general depletion of hydrophobic residues, and the selective depletion of basic residues (Arg and Lys) N-terminal to the sequon.
We also used hypothesis-free machine-learning techniques to identify features that predispose sequons to be preferentially modified by TbSTT3A or TbSTT3B, and finally, we combined these features and parameters derived from the experimental reporter glycoprotein data to develop a Voting Classifier prediction algorithm. This predictor was then applied to all the putative N-glycosylation sequons in the T. brucei proteome to predict those sites preferentially modified by TbSTT3A (leading to paucimannose and/or complex N-glycan occupancy) or TbSTT3B (leading to oligomannose N-glycan occupancy) in bloodstream-form trypanosomes. Two-sample logo plot analysis of the output (a total of 1291 predicted occupied N-glycosylation sites) largely echoes the experimental glycoproteomic and reporter glycoprotein data and implies that the TbSTT3A N-glycosylation sites are enriched for acidic residues and depleted of basic and hydrophobic residues, with the effects of these features more profound to the N-terminal side and within the sequon than to the C-terminal side of the sequon.
It is important to note that available pulse-chase data (21, 55) suggest that TbSTT3A modifies VSG glycoproteins co-translationally, whereas TbSTT3B can act post-translationally (25), and that TbSTT3A and TbSTT3B knockdown data suggest that TbSTT3B is able to modify TbSTT3A sites, but not vice versa (28). Thus, all or most of the aforementioned TbSTT3A versus TbSTT3B sequon selectivity features are dictated by the peptide/sequon acceptor specificity of TbSTT3A and not TbSTT3B, which appears to be able to utilize sequons in almost any amino acid sequence context. This property of TbSTT3B could be particularly useful from a biotechnological point of view to boost the efficient N-glycosylation of recombinant glycoproteins in eukaryotic expression systems. It also nicely explains why trypanosomes can transition from expressing a rich mixture of oligomannose and paucimannose/complex N-glycans in the bloodstream-form parasite (16,18,19) to predominantly oligomannose N-glycans in the procyclic form of the parasite (12, 20) by simply down-regulating the expression of TbSTT3A, as observed at both the mRNA (28,56) and protein levels (35,57,58).
The results reported here and in Ref. 44 are consistent with the mechanisms of resistance in T. brucei to certain toxic lectins and carbohydrate-binding small molecules reported in an interesting series of studies (59 -61). These workers demonstrated that the parasites could escape the effects of these trypanocidal agents, all of which bind principally to oligomannose N-glycans, by either switching to the expression of a VSG type that naturally does not carry oligomannose N-glycans or by suppressing the expression of TbSTT3B. In this way, the parasites effectively exchange oligomannose for puacimannose and complex N-glycans that are poor ligands for the trypanocides.
Interestingly, TbSTT3A and TbSTT3B are found in tandem array in the trypanosome genome, together with a preceding truncated TbSTT3 pseudogene and followed by a full-length TbSTT3C gene that is more similar to TbSTT3B than to TbSTT3A. However, TbSTT3C is not significantly expressed in bloodstream-form or procyclic form of the parasite (28, 35, 56 -58). Nevertheless, transgenic expression of TbSTT3C in S. cerevisiae clearly shows that it is a functional OST with a similar preference for sequons flanked by acidic amino acids to TbSTT3A but an LLO donor specificity like TbSTT3B (28,44). Amino acid sequence alignment of TbSTT3A, -B, and -C and molecular model building, based on the C. lari PglB structure (46) was performed. The models suggest that the presence of a large, flexible and highly positively charged Arg residue side chain (Arg-397) very close to the active site of the enzyme in TbSTT3A and TbSTT3C, compared with a His residue in TbSTT3B, may play a role in the selectivity of TbSTT3A and TbSTT3C for sequons containing and flanked by acidic Asp and Glu residues. TbSTT3A and TbSTT3C also contain Arg residues in place of neutral Gln-567 and Gly-406 residues in TbSTT3B, locations close enough to the active site to potentially interact with sequon-flanking anionic residues. The accompanying paper (44) elegantly addresses the issues of peptide acceptor and LLO donor specificities of all three TbSTT3s by heterologous expression of each and chimeras thereof in various yeast mutants. The accompanying paper concludes that the region containing Arg-397 and Arg-406 in TbSTT3A and TbSTT3C controls peptide acceptor specificity, and this is consistent with our suggestions from molecular modeling.
There are similarities and differences between the multisubunit mammalian STT3A-and STT3B-based OSTs and the single subunit TbSTT3A and TbSTT3B OSTs of T. brucei. (i) In both, the STT3A OSTs operate co-translationally and get the first option to glycosylate a given sequon, whereas the STT3B OSTs can operate post-translationally on what is left (25,62). (ii) In both, the OSTs show differences in peptide acceptor substrate specificity. However, in T. brucei this is controlled by the physicochemical properties of the amino acids surrounding the acceptor sequon, whereas in mammalian cells this is controlled by the position of the sequon relative to the C terminus of the protein (63) or proximity to the signal peptide-cleaved N terminus of the protein and to cysteine residues (23,63,64). The latter appears to relate to the presence of the mutually redundant MagT1 or TUSC3 thioredoxin-like oxidoreductase subunits (equivalent to the yeast Ost3 and Ost6 subunits) in the STT3B OST that may form mixed disulfides with the sequonproximal cysteine residues and thus increase residence time with the STT3B OST (64,65). A role for oxidoreductase activities of Ost3 and Ost6 in yeast OST acceptor site specificity was also previously shown (66). In this regard, it is worth noting that TbSTT3B and TbSTT3C contain a CXC motif (absent in TbSTT3A) that is predicted from the C. lari OST structure to be proximal to the acceptor peptide (46). Such CXC sequences can have a disulfide isomerase activity (67) that might conceivably increase the acceptor substrate range of TbSTT3B and TbSTT3C. (iii) Whereas both STT3A and STT3B OSTs prefer the mature Glc 3 Man 9 GlcNAc 2 -PP-dolichol LLO donor, the T. brucei OSTs have distinct LLO donor specificities such that the presence of the ALG12-dependent c-branch of the conventional (but glucose-free) triantennary Man 9 GlcNAc 2 -PP-dolichol LLO is required by TbSTT3B but not tolerated by TbSTT3A (27,44). An important consequence of this differential LLO specificity is that, because Golgi mannosidase II activity is also absent in T. brucei, N-glycans derived from TbSTT3B glycosylation cannot be processed to paucimannose or complex structures, which must instead be derived exclusively from TbSTT3A glycosylation.

N-Glycosylation in Trypanosoma brucei
In summary, the two simultaneously operating acceptor substrate-specific and donor substrate-specific N-glycosylation systems of bloodstream-form T. brucei have been further characterized in this paper. Whereas no canonical OST subunits, other than catalytic STT3 subunits, could be found in the parasite genome, we can now confirm that there are no noncanonical subunits either. Instead, TbSTT3A and TbSTT3B appear to form multimeric high-molecular-weight complexes containing either TbSTT3A alone or TbSTT3A and TbSTT3B. Further insights into the peptide acceptor specificity of TbSTT3A have been provided, and an algorithm has been generated to predict, proteome-wide, which OST will likely operate on which putative N-glycosylation site. Taken together with the unusual specificities of T. brucei UGGT, GnTI, and GnTII enzymes described in the Introduction (19,33,34), and the apparent absence of a regulated ER unfolded protein response (19,68), we may conclude that protein N-glycosylation and downstream processing in this divergent eukaryote is worthy of note and that its unusual features may provide therapeutic possibilities.

Generation of genetically modified trypanosomes with in situ tagged TbSTT3A-HA 3 and TbSTT3B-Myc 3
The TbSTT3A,B,C Ϫ/ϩ heterozygote described in Ref. 28 was used for C-terminal HA 3 in situ tagging of the remaining TbSTT3A allele using a pMOTagH4 plasmid and C-terminal Myc 3 in situ tagging of the remaining TbSTT3B allele using a pMOTag4M4 plasmid (70). For the pMOTagH4 plasmid, the 1328 bp from the C terminus of TbSTT3A and the 1057 bp 3Ј-UTR downstream of the gene ORF were PCR-amplified from genomic DNA using Kod Hot Start polymerase with primers 5Ј-ataagtatctcgagcaagtttgcttgccccgttcg-3Ј and 5Ј-ataagtaactcgagctcgctctgaaaatacaggttttcgacttcgtaatggaaccgcttcgct-3Ј and 5Ј-ataagtatggatccccacatcgtttcaatcgccgc-3Ј and 5Ј-ataagtaaggatccactcacaatcgtgcttacagcc-3Ј as forward and reverse primers, respectively. The PCR products were cloned into the plasmid using the XhoI and BamHI (underlined). A tobacco etch virus restriction site was included downstream of the ORF of TbSTT3A (italics). The construct was linearized before being transfected into the TbSTT3A,B,C Ϫ/ϩ heterozygote cell line, and transfected cells were selected by addition of hygromycin. The pMOTag4M4 plasmid was ordered from Genescript. It included 1032 bp from the C terminus of TbSTT3B, located upstream of the Myc 3 epitope in the plasmid. The plasmid also included 835 bp of the 3Ј UTR of TbSTT3B, which were located downstream of the blasticidin resistance gene in the plasmid. The construct was linearized before being transfected into the TbSTT3A/B/C Ϫ/ϩ TbSTT3A-HA 3 heterozygote cell line, and transfected cells were selected by addition of blasticidin.

SDS-PAGE and Western blotting
Reducing SDS-PAGE was run using pre-cast Novex BisTris gels with MOPS running buffer (Invitrogen). Proteins were transferred to nitrocellulose using an iBlot system (Invitrogen) and stained with Ponceau S (Sigma) before being blocked in 50 mM Tris-HCl, 0.15 M NaCl, 0.05% Tween 20, 0.25% BSA, 0.05% sodium, and 2% fish skin gelatin, pH 7.4, for 20 min. The membrane was then incubated for 30 min with primary antibody in a 50-ml Falcon tube followed by washing using a SnapID system (Millipore). Subsequently, the labeled secondary antibody was incubated for 30 min followed by a washing step. The blots were imaged using an ODYSSEY SA near infrared imager (LI-COR Biosciences). Secondary LI-COR antibodies (IRDye-800CW goat anti-mouse 1:15,000 or IRDye-680RD donkey anti-mouse 1:20,000) were used to bind the primary mouse anti-HA and anti-Myc antibodies.

Blue native gels and Western blotting
Blue native gel electrophoresis was run using components from the native PAGE kit (Invitrogen). The protocol followed the manufacturer's instructions except that no G-250 was added to the sample buffer and the 1ϫ native PAGE Light Cathode buffer was diluted 1:4 in 1ϫ running buffer to reduce Coomassie interference of the post-blotting LiCor imaging.

N-Glycosylation in Trypanosoma brucei
Tris-HCl, pH 6.8, 20 mM EDTA plus the protease inhibitors 0.8 mM PMSF, 0.1 mM TLCK, and 1ϫ EDTA-free protease inhibitor mixture (Roche Applied Science). Subsequently, the lysate was centrifuged (4°C, 11,000 ϫ g, 15 min), and the supernatant was moved to a new tube. Protein G magnetic beads, prewashed in lysis buffer, were added to the lysate (30 min, 4°C) and captured to absorb non-specific binding components. The lysate was moved to a new tube followed by incubation for 1 h with anti-HA or anti-Myc antibody (1 g/ml) at 4°C followed by fresh pre-washed magnetic beads. The beads were captured and washed twice with 1 ml of lysis buffer and once with 10 mM Tris-HCl, pH 6.8, 4 mM EDTA, 0.1% digitonin containing the same protease inhibitors. Proteins were eluted from the beads in 30 l of reducing SDS-sample buffer with heating (100°C for 10 min). The eluted proteins were subsequently separated by SDS-PAGE.

SILAC proteomics
Heavy and light labeled cells were harvested separately (15 min, 800 ϫ g, 4°C) and washed and resuspended in trypanosome dilution buffer (20 mM Na 2 HPO 4 , 2 mM NaH 2 PO 4 , 80 mM NaCl, 5 mM KCl, 20 mM glucose) for cell counting. The cells were mixed 1:1 before undergoing immunoprecipitation, as described above. The eluted proteins in 25 l of reducing SDS sample buffer were S-alkylated with 5 l of 300 mM iodoacetamide (30 min, dark) and loaded on Novex NUPAGE 4 -12% BisTris gel and run at 200 V using MOPS buffer until the proteins had migrated about 2 cm into the gel (visualized by Simply Blue Safe Stain, Thermo Fisher Scientific). The protein-containing region of the gel was excised and subjected to in-gel trypsin digestion, and aliquots of the extracted peptides were analyzed on an LTQ-Orbitrap Velos Pro mass spectrometer coupled with a Dionex Ultimate 3000 RS HPLC system (Thermo Fisher Scientific). The sample peptides were loaded at 5 l/min onto a trap column (100 m ϫ 2 cm, PepMap nano-Viper C18 column, 5 m, 100 Å, Thermo Fisher Scientific) equilibrated in 98% buffer A (2% acetonitrile and 0.1% formic acid (v/v)) and 2% buffer B (80% acetonitrile and 0.08% formic acid (v/v)). The trap column was washed for 3 min at the same flow rate and then switched in-line with a Thermo Fisher Scientific resolving C18 column (75 m ϫ 50 cm, PepMap RSLC C18 column, 2 m, 100 Å). The peptides were eluted from the column at a constant flow rate of 300 nl/min with a linear gradient from 98% buffer A to 40% buffer B in 128 min and then to 98% buffer B by 130 min. LTQ-Orbitrap Velos Pro was used in data-dependent mode. A scan cycle comprised an MS1 scan (m/z range from 335 to 1800) in the Orbitrap (resolution 60,000) followed by 15 sequential data-depandant collision-induced dissociation MS2 scans (the threshold value was set at 5000 and the minimum injection time was set at 200 ms).

Glycoproteomics
The protein-based approach was based on the methodology described previously (28). Cells were harvested by centrifugation and osmotically lysed at 3.5 ϫ 10 8 cells/ml for 5 min at 37°C in the presence of 0.1 M TLCK, 1 mM benzamidine, 1 mM phenylmethylsulfonyl fluoride (PMSF), 1 g/ml leupeptin, 1 g/ml aprotinin, and Phosphatase Inhibitor Mixture II (Calbi-ochem). Cell ghosts from a total of 3.5 ϫ 10 9 trypanosomes, enriched for non-VSG cellular glycoproteins, were collected by centrifugation (16,000 ϫ g, 15 min, 4°C) and solubilized in 250 l of detergent buffer (4% SDS, 0.1 M DTT, 0.1 M Tris-HCl, pH 7.5) using probe sonication for 30 s before and after heating to 85°C for 20 min. S-Alkylation was performed by mixing with 250 l of 8 M urea, 0.5 M iodoacetamide, 0.1 M Tris-HCl, pH 8.5, for 1 h, room temperature, in the dark. Unreacted iodoacetamide was quenched by the addition of 10 l of the detergent buffer. After centrifugation (16, 000 ϫ g, 15 min), the supernatant was transferred to a filtration device with a 30-kDa molecular cutoff (Sartorius), and the majority of the detergent was removed by diafiltration with 8 M urea, 0.1 M Tris-HCl, pH 8.5, followed by lectin-binding buffer (1 mM CaCl 2 , 1 mM MnCl 2 , 150 mM NaCl in 40 mM Tris-HCl, pH 7.4). An aliquot (400 g of protein) was mixed with 150-l packed volume of ricin (RCA 120 )-agarose beads (Vector Laboratories) and rotated gently for 2 h. The beads were washed with lectin-binding buffer and eluted with 300 l of 30 mg/ml lactose and 30 mg/ml galactose (Sigma) in 40 mM Tris-HCl, pH 7.4. After overnight incubation, the supernatant containing the eluted glycoproteins was collected by centrifugation (10,000 ϫ g for 10 min). ConA-coupled agarose beads (0.15 ml packed volume, Vector Laboratories) was added to the recovered supernatant from the RCA 120 pulldown and incubated at 4°C overnight. The beads were washed with lectin-binding buffer and eluted with 300 l of 0.5 M methyl-␣-D-mannopyranoside (Sigma) in 40 mM Tris-HCl, pH 7.4. After gentle rotation for 2 h, the supernatant containing glycoproteins was recovered by centrifugation. The RCA 120 and ConA-eluted glycoproteins were mixed, transferred to a 30-kDa filter, the buffer exchanged with 25 mM ammonium acetate, pH 5.5, and subjected to digestion with 180 milliunits of Endo H (Roche Applied Sciences) overnight at 37°C. The Endo H-released glycans were removed by centrifugation, and the remaining material was exchanged into 40  After overnight incubation at 37°C, the PNGaseF-released glycans were removed by centrifugation, and the deglycosylated proteins were diafiltered into 40 mM ammonium bicarbonate and subsequently digested with a mixture of 1:100 (enzyme/ substrate) Lys-C and 1:20 trypsin (Roche Applied Sciences) at 37°C for 48 h. The peptides were collected by centrifugation through the filter, dried in a Speedvac, and desalted using a ZipTip C18 micro-column (10 l, Merck Millipore) prior to liquid chromatography tandem mass spectrometry (LC-MS/MS).
An alternative (glyco)peptide-based approach was also performed, based on the method of Ref. 71. An aliquot of the denatured and S-alkylated sample (400 g) was first digested on a 10-kDa filter with a mix of Lys-C and trypsin, as described above. The (glyco)peptides were collected by centrifugation through the filter in lectin-binding buffer and were mixed with lectin solution containing a mixture of ConA and RCA 120

N-Glycosylation in Trypanosoma brucei
resulting in mixtures of (glyco)peptides and lectins with a mass proportion of 1:2. After gentle rotation for 2 h, the mixtures were transferred to a 30-kDa filter, and the lectin-captured glycopeptides were diafiltered into 25 mM ammonium acetate, pH 5.5, and subjected to Endo H digestion. After overnight incubation at 37°C, the Endo H-released peptides (Asn N-GlcNAc residue) were collected by gentle centrifugation. The peptides containing Endo H-resistant glycans still bound to lectins were diafiltered into 40 mM ammonium bicarbonate buffer in H 2 [ 18 O] and digested with PNGaseF overnight at 37°C. The de-glycosylated peptides were collected by centrifugation, mixed with the Endo H released fraction, dried in a Speedvac, and desalted using ZipTip C18 prior LC-MS/MS, which was performed as described above.

LC-MS/MS analysis and data processing
For glycoproteomics, the LC was performed on a fully automated Ultimate U3000 Nano LC System (Dionex) fitted with a C18 trap (PepMap nanoViper, Thermo Fisher Scientific) and resolving columns (PepMap RSLC) with inner diameters of 100 and 75 m and lengths of 2 and 50 cm, respectively. Mobile phases consisted of 0.1% formic acid (Sigma) in 2% acetonitrile (Merck, Darmstadt Germany) for solvent A and 0.08% formic acid in 80% acetonitrile for solvent B. Samples were loaded in solvent A. A linear gradient was set as follows: 0% B for 5 min and then a gradient up to 40% B in 122 min and to 98% B in 10 min. A 20-min wash at 98% B was used to prevent carryover, and a 20-min equilibration with 2% B completed the gradient. The LC system was coupled to an LTQ-Orbitrap Velos mass spectrometer (Thermo Fisher Scientific) equipped with an Easy spray ion source and operated in positive ion mode. The spray voltage was set to 2 kV and the ion transfer tube at 250°C. The full scans were acquired in a Fourier transform MS mass analyzer that covered an m/z range of 335-1800 at a resolution of 60,000. The MS/MS analysis was performed under data-dependent mode to fragment the top 15 precursors using collisioninduced dissociation. A normalized collision energy of Ϫ35 eV, an isolation width of m/z 2.0, an activation Q value of 0.250, and a time of 100 ms were used. The raw files were converted to mgf format by MSConvert software from ProteoWizard (proteowizard.sourceforge.net). 4 The searches were carried out against the T. brucei 927 annotated proteins database (version 8.0, downloaded from TriTrypDB (72), www.tritrypdb.org/) 4 using Mascot software (version 2.4.0, Matrix Science Inc., Boston). The search parameters for Mascot software were set as follows: peptide tolerance, 5 ppm; MS/MS tolerance, 0.5 Da; enzyme, trypsin; one missed cleavage allowed; and fixed carbamidomethyl modifications of cysteines. Oxidation of methionine, N-acetylglucosamine modification of Asn, and deamidation of Asn to Asp containing a single 18 O atom (2.9890 Da mass increase) were used as variable modifications.

Dataset extraction
We extracted from the Mascot result files all the peptides with an ion score of Ͼ20. From this we selected 350 glycosy-lated sites with a deamidation (Asn to [ 18 O]Asp conversion) or Asn-N-HexNAc modification embedded in the N[caret]P(ST) consensus. The list was used to count the number of times (Ն 3) that these changes were detected for each peptide. To increase the number of modified peptides we decided to re-process a previously published work in our laboratory (28). This dataset was re-searched with the same Mascot parameters and database used for this publication. This made it possible to include 14 new peptides preferentially HexNAc-modified and to compile a list of 186 peptides used for the next phase of machine learning implemented in Python with the scikit-learn package (73).
The data set reported here and that from Ref. 28 identified 170 common and 180 and 155 unique glycosylation sites, respectively (supplemental Fig. S1A). To check the consistency of the two datasets, we selected 92 peptides that were observed Ն 4 times in both datasets and plotted the frequencies of the Asn-N-HexNAc and Asn to [ 18 O]Asp modifications. The two datasets had a good correlation (r 2 ϭ 0.83) (supplemental Fig. S1B). However, the experimental procedures used to generate the 2009 dataset (i.e. without the use of H 2 18 O) cannot discriminate between the spontaneous non-enzymatic deamidation of asparagine to aspartic acid versus the PNGaseFmediated deamidation produced during the cleavage of the N-glycan. From the linear regression, we could deduce that non-enzymatic deamidation contributed a significant amount (about 18%) of the total deamidation seen in the 2009 dataset (supplemental Fig. S1B). For this reason, we only used Asn-N-HexNAc containing (TbSTT3B substrate) peptides (with a frequency of Ն 0.6) from this dataset to augment our machinelearning training set.

Machine learning
The deamidation proportion was computed for each peptide as (DC/DC ϩ HC), where DC is the Deamidation Count (i.e. the number of times a peptide with [ 18 O]Asp is detected), and HC is the HexNAc Count (i.e. the number of times a peptide with Asn-N-HexNAc is detected). This score was used to classify each peptide as preferentially deamidated (score Ͼ0.8 n ϭ 70) or preferentially HexNAc-modified (score Ͻ0.3% n ϭ 56). This dataset was used to extract a sequence-based feature from 10 amino acids before and after the glycosylated asparagine with the ASAP package in Python (74). We also added some inhouse features derived from the knowledge of the TbSTT3A recognition and transfer experiment reported in Fig. 4B and Table 1. To this end we created "Bonus Features" for each residue position probed in that experiment based on the increased transfer efficiency observed when that site is occupied by an Asp (or, by inference, a Glu residue). We also included a "Bonus All" feature that summed all of the bonus scores for the peptide when it contained more than one Asp and/or Glu residue and a "Bonus Max" feature that selected only the highest bonus score in such cases. Finally, we also created "Bonus Presence D" and "Bonus Presence E" features that simply recorded the presence or absence of Asp or Glu, respectively, at each residue location. We then developed three machine learning algorithms: an RFC, an ETC, and a support vector machine classifier (SVM). The predictors were used to extract the importance of all the features and to rank the features with recursive feature elimination and cross-validated selection of the best number of features (RFECV methodology). The selected features were used to train the three machine learning algorithms (RFC, ETC, and SVM) that were further optimized using a Bayesian global optimization with Gaussian processes (https://github.com/fmfn/ BayesianOptimization). 4 The optimized machine learning algorithms were combined in a voting classifier to produce our final predictor. The ability of the developed classifiers (RFC, ETC, SVM, and Voting Classifier) to discriminate between the deamidated or HexNAc-modified peptides was assessed with the area under the curve of the receiver-operating characteristic curve. The area under the curve score was computed 100 times with a 5-fold cross-validation, using each time a different random split of the original dataset (supplemental Fig. S5).