The N-glycan structures of the antigenic variants of chlorovirus PBCV-1 major capsid protein help to identify the virus-encoded glycosyltransferases

The chlorovirus Paramecium bursaria chlorella virus 1 (PBCV-1) is a large dsDNA virus that infects the microalga Chlorella variabilis NC64A. Unlike most other viruses, PBCV-1 encodes most, if not all, of the machinery required to glycosylate its major capsid protein (MCP). The structures of the four N-linked glycans from the PBCV-1 MCP consist of nonasaccharides, and similar glycans are not found elsewhere in the three domains of life. Here, we identified the roles of three virus-encoded glycosyltransferases (GTs) that have four distinct GT activities in glycan synthesis. Two of the three GTs were previously annotated as GTs, but the third GT was identified in this study. We determined the GT functions by comparing the WT glycan structures from PBCV-1 with those from a set of PBCV-1 spontaneous GT gene mutants resulting in antigenic variants having truncated glycan structures. According to our working model, the virus gene a064r encodes a GT with three domains: domain 1 has a β-l-rhamnosyltransferase activity, domain 2 has an α-l-rhamnosyltransferase activity, and domain 3 is a methyltransferase that decorates two positions in the terminal α-l-rhamnose (Rha) unit. The a075l gene encodes a β-xylosyltransferase that attaches the distal d-xylose (Xyl) unit to the l-fucose (Fuc) that is part of the conserved N-glycan core region. Last, gene a071r encodes a GT that is involved in the attachment of a semiconserved element, α-d-Rha, to the same l-Fuc in the core region. Our results uncover GT activities that assemble four of the nine residues of the PBCV-1 MCP N-glycans.

Syngen 2-3 are referred to as Osy viruses, those that infect Chlorella heliozoae SAG 3.83 are referred to as SAG viruses, and those that infect Micractinium conductrix Pbi are referred to as Pbi viruses. The prototype chlorovirus Paramecium bursaria chlorella virus (PBCV-1) is an NC64A virus (8), and its 330-kb genome is predicted to encode 416 proteins (9). The major capsid protein (MCP or Vp54) is coded by the gene a430l, and its gene product has a predicted molecular mass of 48,165 Da, which increases to 53,790 Da due to N-glycosylation at four positions (10,11).
The predominant glycoform of the four PBCV-1 N-glycans is a nonasaccharide (Fig. 1, WT) (11) with several unusual structural features, including the following: (i) it is not linked to the typical Asn-X-(Thr/Ser) sequon; (ii) it is highly branched; (iii) ␤-glucose (Glc) forms the N-linkage with Asn; (iv) fucose (Fuc) is substituted at all available positions; and (v) there are two rhamnose (Rha) residues with opposite configurations plus an L-Rha capped with two O-methyl groups. Two monosaccharides, arabinose (Ara) and mannose (Man), occur as nonstoichiometric substituents, resulting in the existence of four glycoforms ( Fig. 1). Currently, structures of this type are unknown elsewhere in the three domains of life.
The genes encoding these GTs are scattered throughout the genome of the virus, and PSORT predicted that most of these proteins are soluble and located in the cytoplasm; a473l and a219/222/226r are predicted to encode GTs with six and nine transmembrane domains, respectively, whereas the ngLOC program predicts that A075L is located in the chloroplast, but only at a 15% confidence level.
Thus, PBCV-1 encodes six GTs, a small number when compared with the complexity of its N-glycans. One solution to this paradox is that one or more of the six GTs might have multiple functional domains, making this restricted repertoire of enzymes sufficient to synthesize these structures. A second possibility is that some other virus genes encode enzymes that do not match GTs in the databases and so are overlooked during the searches. Finally, we cannot exclude the possibility that a host-encoded GT(s) could be involved in the glycosylation process.
The protein encoded by gene a064r is of special interest. This protein is coded by PBCV-1 but not by any other chlorovirus where the glycan structure is known (12), suggesting that the a064r gene product is involved in the synthesis of a structural motif of the N-glycan not shared with the other chloroviruses (14 -16). In addition, several spontaneous mutants (or antigenic variants) in a064r have been isolated, and their MCPs migrate faster on SDS-PAGE compared with the MCP of PBCV-1, suggesting that the glycan portion of the capsid protein is altered.
These PBCV-1 spontaneous mutants are divided into six antigenic classes, each denoted with a letter, based on their differential reaction to five different polyclonal antibodies (12,17). Each class reacts specifically with one polyclonal antibody preparation, except for Class E, which cross-reacts with Class A and B polyclonal antibodies, suggesting that the phenotype of Class E variants shares some of the structural features of the other two variants. Notably, all six antigenic classes, except for C, have a mutation in the a064r gene (Table 1 and Fig. 1).
For the reasons cited above and elsewhere (12,18), we hypothesize that the machinery for PBCV-1 glycosylation is encoded primarily, if not totally, by the virus. Furthermore, glycosylation of the PBCV-1 MCP occurs in the cytoplasm. The current study addresses two issues related to PBCV-1 glycosylation. First, the structures of the glycans attached to five PBCV-1 antigenic variant MCPs are described. Second, using genetic and glycan structure evidence, we identify four of the putative GT activities involved in the synthesis of the glycan(s).

Glycan structures from representatives of the antigenic variants
The structures of the glycans from five different classes (Class E variants were not included in this study) were determined by combining chemical (via GC-MS of the appropriate derivatives) and NMR analyses, using the consolidated approach developed for PBCV-1 and the other chloroviruses (11, 14 -16). The 1 H NMR spectrum of each variant (Fig. 2) differed from that of WT virus PBCV-1, supporting our hypothesis that the glycans were different. Accordingly, the features of each antigenic class will be described in the order of increasing structural complexity of the glycans. By convention, the monosaccharide residues identified in each oligosaccharide are labeled with the same (or a similar) letter used for the analogue unit of the glycan of the WT virus PBCV-1 ( Fig. 1 and Figs. S1-S8 in the supporting material). Unless mentioned otherwise, we have focused on the major glycan structure of each of the variants. In some cases, minor structures made up less than 5-10% of the total sample.

Class D: Structure of the glycan of the antigenic variant P1L6
This virus is the most difficult to handle in laboratory conditions because it is unstable and tends to degrade during normal purification procedures. This intrinsic fragility of the viral particles hindered the obtaining of large amounts of biomass. Thus, to avoid further losses during chromatographic purification, the glycopeptide mixture was studied directly, omitting this purification step.
Inspection of the 1 H NMR spectrum of the P1L6 glycan ( Fig.  2 and Fig. S1) showed two anomeric signals at 5.61 and 5.28 ppm along with a set of small signals at ϳ5.0 ppm, whereas the other regions were crowded with signals from peptides resulting from capsid protein digestion.
The structure of the glycan was determined by combining the chemical data with the analysis of the heteronuclear single quantum correlation (HSQC) spectrum. GC-MS analysis of the partially methylated and acetylated alditol derivatives (not shown) indicated that the sample contained a 2-linked Fuc, a terminal Gal, a terminal Xyl, and a 3,4-linked Glc. The HSQC Virus-encoded glycosyltransferases spectrum ( Fig. S1 and Table S2) had four anomeric resonances, and the overlap with the spectrum of the PBCV-1 glycan, used as a reference, permitted the identification of three of the four residues. As for H (3,4-substituted N-linked ␤-Glc), E (terminal ␣-Gal), and N (terminal ␤-Xyl), the 13 C NMR chemical shifts were almost identical to that of the reference glycan, with only minor variations in the 1 H NMR chemical shifts. The remaining carbon resonances were assigned to Fuc A, whose values differed from those of the reference glycoside (19) due to glycosylation at C-2. Thus, the N-linked glycan of P1L6 is a tetrasaccharide, significantly truncated compared with that of WT PBCV-1. It contained all of the units of the chlorovirus N-glycans present in the conserved core region except for the Xyl attached to Fuc, referred to as the distal Xyl (14).

Class C: Structure of the antigenic variants E11 and E1L3
The identity of the E1L3 and E11 glycans was deduced from the similarities between the anomeric region of their proton spectra (Fig. S2A) and by the coincidence of their HSQC spectra (Figs. S2B and S3). In addition, when compared with P1L6 HSQC (Fig. S1), both E11 and E1L3 had one additional anomeric signal at 5.03 ppm.
As for E1L3, comparison of its HSQC spectrum with that of PBCV-1 glycopeptides permitted the ready identification of several monosaccharide units (Table S3): the N-linked Glc H, the proximal Xyl N, and the Gal E. The anomeric proton at 5.03 ppm was identified by analyzing the other 2D NMR spectra, including correlation spectroscopy (COSY), total correlation spectroscopy (TOCSY), and transverse rotating frame Overhauser effect spectroscopy (T-ROESY). These data revealed that this resonance arose from a D-Rha unit linked at O-3 of the Fuc unit A. Unlike the WT glycan, this D-Rha was not further substituted with the terminal Man G; therefore, it was labeled F (and not F). Thus, E1L3 and E11 glycans are extended derivatives of P1L6 with the D-Rha unit. Accordingly, the Fuc residue was substituted at both O-2 and Table 1 Genetic features of the 20 PBCV-1 antigenic variants These variants are divided into six classes, depending on their reaction with polyclonal antibodies. Class E is not included in this study and is omitted in this table, because the members of this group cross-react with the antibodies selective for both Classes A and B, suggesting an intermediate phenotype. All of the antigenic classes except for C have a mutation in a064r; this gene encodes for a protein of 638 amino acids that includes three domains of ϳ200 amino acids each (Fig. 1). The N-terminal domain 1 is a GT, domain 2 only matches bacterial proteins of unknown function, and domain 3 has a weak match with methyltransferases. The structure of the N-glycans has been determined in this work ( Fig. 1) for the antigenic variants in boldface type.

Antigenic variant Description
Antigenic Class A: mutation in the second domain of a064r (NP_048412.1) EPA- 11 Nonsense mutation (G 3 A) at nt position 932 in the gene; truncated protein (307 amino acids) P9L1 Nonsense

Class B: Structure of the glycan of antigenic variants EPA1 and EPA2
The 1 H NMR (Fig. S4A) and HSQC (EPA2 reported as an example in Fig. S4B) spectra of EPA1 and EPA2 glycopeptides were almost identical, and both displayed two anomeric signals at 4.5-4.4 ppm, the region that in the PBCV-1 WT glycan contains the two ␤-Xyl units M and N.
The units with anomeric signals at 5.63, 5.23, 5.04, 5.01, and 4.42 ppm were readily attributed to the Fuc A, Gal E, D-Rha F, Glc H, and proximal Xyl N, respectively, by comparison with PBCV-1 glycan data. Notably, the proton and carbon chemical shifts of Fuc A (Table S4) indicated that this Fuc was substituted at all of the positions available, whereas D-Rha F was not glycosylated.
Hence, NMR analysis focused on the additional anomeric signal at 4.46 ppm. Studying the connectivities from all of the 2D NMR spectra revealed that this signal belonged to a Xyl unit, connected to O-4 of A as in the PBCV-1 WT glycan, but not further substituted. Accordingly, this Xyl was labeled M. Overall, the NMR data showed that the structure of the glycans of the antigenic class B, EPA1 and EPA2, consisted of the conserved core region characteristic of all of the chlorovirus N-glycans described to date (14 -16) plus the semiconserved element, the D-Rha unit F (Fig. 1).

Class A: Structure of the glycans of antigenic variants P91, P9L1, and P9L10
The glycopeptides from the three variants of Class A shared a similar pattern of signals in the anomeric region of the 1 H NMR spectra, although some marginal variations of their relative intensities were noted (Fig. S5A). Compared with Class B, the anomeric region of the 1 H NMR spectra of class A contained several additional signals, and the structural elucidation of these glycopeptides was performed by the combined analysis of the 2D NMR spectra (Figs. S5-S7 and Tables S5 and S6) as discussed for the other variants and detailed in the supporting material. In the WT, each monosaccharide is denoted with the letter used during the NMR experiments, and residues inside the dotted box define the conserved core region. The glycosidic linkage of nonstoichiometric substituents is denoted with a broken line. Glyco1 and glyco2 are the most abundant glycoforms; both have Man. Glyco1 lacks Ara, and the units of the terminal L-Rha disaccharide are labeled as C and I. Ara is present in glyco2, and the two L-Rha units are labeled B and L. Antigenic classes are labeled with a capital letter (A-D and F), and the structure of their glycans depends on the kind of mutation found in the genome, which in all cases, except for Classes C and D, occurs in the gene a064r. EPA1 (antigenic Class B) and E11 (antigenic Class C) have identical mutations in domain 1 of a064r, but E11 has an additional mutation in gene a075l. P91 has an additional glycoform with ␤-L-Rha methylated at O-2; the letters used to label this oligosaccharide are not reported due to crowding, and the structure is displayed in Fig. S7.

Virus-encoded glycosyltransferases
Indeed, P9L10 produces a family of four glycopeptides, named P9L10 A-D (Fig. S6), with the distal Xyl M (or M؆; see "Discussion" in the supporting material) further elongated with the ␤-L-Rha unit. These four glycans differed in the nonstoichiometric presence of two residues, Man and Ara, as shown in Fig. S6.
As for the P9L1 glycopeptides, the HSQC spectrum was almost superimposable on that of P9L10. Some variations occurred for the intensities of some signals as deduced by comparing their proton spectra (Fig. S5A). In P9L1, the intensities of E and E were comparable, suggesting that D-Rha was much less substituted with Man compared with P9L10 glycopeptides. Densities reporting the Ara (D unit) were not detected, indicating that this unit was absent or below the detection limit. Accordingly, P9L1 produced two main glycopeptides, analogues to P9L10 A and P9L10 B (Fig. S6).
As for the glycopeptides of P91 ( Fig. S7 and Table S6), most of the densities of its HSQC spectrum (Fig. S7, A and B) overlapped with those of the P9L1 and P9L10 variants. Like P9L1, the P91 variant lacks the Ara substitution to ␤-L-Rha, and it produced two oligosaccharides equivalent to P9L10 A and P9L10 B , as confirmed by analyzing the homonuclear spectra (T-ROESY in Fig. S7C and TOCSY in Fig. S7D). Importantly, these spectra enabled the assignment of densities that escaped the HSQC detection due to the low amount of the sample, including C-4 or C-5 of M؆ (the putative position of these densities is enclosed in a dotted circle in Fig.  S7B).
The P91 HSQC spectrum contained some additional signals, including a methyl group at 1 H/ 13 C 3.56/62.4 ppm (Fig. S7A) and an anomeric proton at 1 H/ 13 C 4.75/103.4 ppm (Fig. S7B). NMR analysis of this new residue determined that it was a ␤-L-Rha, labeled as I؆, that differed from I and I because it was methylated at O-2 (Table S6) , it was not possible to confirm whether the D-Rha unit F was further elongated with Man G, although this situation is the most plausible given the abundance of this residue, similar to that of Gal or Glc, as visible in the GC-MS profile of the monosaccharide analysis (Fig. S5C). GC-MS analysis also confirmed the presence of the rhamnose unit methylated at O-2 and detected a tiny amount of arabinose that escaped the NMR analysis due to its low abundance.

Class F: Structure of the glycan of antigenic variant CME6
Virus CME6 is the only variant isolated for antigenic class F. Its proton and HSQC spectra ( Fig. 2 and Fig. S8) contained an additional anomeric signal at 1 H/ 13 C 5.04/102.7 ppm (Fig.  S8) when compared with those of the P9L10 variant. Comparison of the CME6 HSQC spectrum with that of PBCV-1 readily identified most of the residues, namely A, D-H, M, and N, for which the positions of all of the proton and carbon resonances were almost identical. Of note, units E or F described in class A variants were absent or below the NMR detection limit, indicating that D-Rha was fully substituted with Man. Minor differences were found for the chemical shifts of the residues I and L, the ␤-L-Rha units 2-and 2,3linked, respectively.
Analysis focused on the signal C at 1 H/ 13 C 5.04/102.7 ppm, which had a weak H-1/H-2 correlation in the COSY spectrum (not shown), whereas the TOCSY spectrum read from the H-2 (4.09 ppm) position identified the position of all the other protons of the unit, with H-6 at 1.26 ppm. Thus, C was a Rha, and carbon chemical shift values (Table S7) determined that it was ␣-configured at the anomeric center and not further substituted. HMBC and T-ROESY spectra showed that it was attached to O-2 of I, as in the WT virus, with the only difference that none of its positions were further decorated with a methyl group (Fig. 1).
As for other minor anomeric signals, it was possible to assign only that at 1 H/ 13 C 5.15/101.8 ppm (Fig. S8), labeled B. It was an ␣-Rha, which differed from the most intense C because it was attached to O-2 of L, the ␤-Rha unit with D at O-3.
Collectively, these data indicated that CME6 produces a mixture of two N-glycans ( Fig. 1 and Fig. S8), identical to those described for WT PBCV-1, with the only difference being that the ␣-L-Rha unit was not methylated.  11)). NMR spectra were recorded at 310 K except for P91 (316 K) and E1L3 (323 K).

Virus-encoded glycosyltransferases Predicted PBCV-1-encoded genes involved in the synthesis of its glycan
Based on the discovery that the structures of the glycans from the five antigenic classes were truncated versions of the glycan from the WT PBCV-1 MCP and knowing that PBCV-1 already had several genes annotated as GTs (12) (Table S1), we could predict the function of some of them. These predictions were aided by the fact that the complete genomes of a few representatives of each of the four groups of chloroviruses mentioned in the Introduction (NC64A, Osy, SAG, and Pbi viruses) have been sequenced (7,15), and the structures of the glycan(s) attached to their MCPs have been determined (14 -16). The basic glycan core structure is conserved in all of these viruses (outlined in Fig. 1). Additional sugar decorations occur on their respective core N-glycan structures and present a molecular signature for each chlorovirus. Therefore, knowledge of the glycan structures and the presence/absence of genes from the various groups of chloroviruses contributed to our prediction of the role for two of the PBCV-1 GTs. The features of each putative GT are hereafter discussed in light of the structural information of the glycans determined in this study.

Class D mutants and the role of A111/114R in the core glycan assembly/attachment
The four antigenic mutants in Class D (P1L2, P1L3, P1L6, and P1L10) contain large genomic deletions in similar locations that range in size from 27.2 to 38.6 kb (genome nucleotide locations: 12,407-39,576, 5,238 -42,695, 2,190 -40,797, and 7,698 -39,722, respectively) or collectively from genes a014r through a078r (20). All of the chloroviruses (14 -16), as well as four of the five PBCV-1 antigenic classes, have the conserved glycan core structure composed of five monosaccharides (one unit of D-Glc, D-Gal, and L-Fuc and two units of D-Xyl) attached to the MCP (Fig. 1) plus D-Rha, the semiconserved unit attached to Fuc.
Class D mutants are the only class with a reduced core glycan structure (Fig. 1), and this finding supported two conclusions: (i) the core structure is coded by a gene (or genes) located outside of the collective region of the four large deletion mutants, and (ii) the genes encoding the attachment of the distal D-Xyl, the fifth unit of the conserved core region, and the D-Rha, the semiconserved unit of the core glycan, probably lie in the region between a014r and a078r.
Accordingly, these two observations suggest that the protein A111/114R (NP_048459.2) is involved in the initiation and/or attachment of four of the monosaccharides in the conserved core glycan. Further evidence to support this inference is that this gene and the corresponding tetrasaccharide are present in all of the chloroviruses analyzed to date. Because the chlorovirus core glycan is unique in nature, we assume that it must be at least partially, if not fully, assembled/attached to the MCP by either a single large viral protein or several smaller viral proteins that are common to all of the chloroviruses. In PBCV-1, gene a111/114r is the only annotated orthologous GT gene found outside of the regions of the large deletion mutants described above that is present in all of the other chloroviruses. In addition, no MCP glycans smaller than the four-sugar units have been found in any of the chloroviruses (14 -16) or in the PBCV-1 antigenic variants (this work).
The a111/114r gene, not including the promoter region, consists of 2,583 nt, which is longer than any of the other annotated PBCV-1 GT genes. This gene encodes the protein A111/ 114R, which consists of 860 amino acids organized into at least three domains; the second domain has been annotated as a GT (9).
Because of the size of the a111/114r gene, it is highly likely that mutations have occurred in the gene over time, but such mutations are probably lethal because no variant has been isolated. The working model is that a mutation in the a111/114r gene, and in the orthologous genes from all of the chloroviruses, would result in the absence of a glycan core attached to the MCP, leading to the loss of a viable virus. This leads to the conclusion that the unique four-sugar structure is the minimal glycan required for chlorovirus survival, although the way it contributes to the viability of the virus is still unclear. Certainly, the hydrophilic nature of the glycans would help the virus to remain soluble, and thus infective, in an aquatic environment. Furthermore, as noted above, Class D variants are unstable in the laboratory, suggesting that the glycans may be important for MCP folding and/or for the correct assemblage of the capsid.
Thus, all available evidence supports the conclusion that A111/114R plays a key role in the assemblage of part or all of the conserved core oligosaccharide.

Class C mutants and the role of A071R
The glycans of the Class C mutants (Fig. 1) differ from those of Class D because they have a D-Rha attached to L-Fuc; however, like the Class D mutants, the Class C mutants lack the distal D-Xyl and its attached sugars. The gene encoding the enzyme involved in the attachment of D-Rha to L-Fuc is located somewhere in the region between genes a005r and a077l because the Class D antigenic mutant P1L6, which lacks the D-Rha, has no mutations in its genome other than the large deletion that extends from gene a005r to a077l. Therefore, the gene that encodes the protein that performs this task must (i) lie between genes a005r and a077l; (ii) be found in other NC64A viruses and OsyNE5, all of which have a D-Rha attached to the L-Fuc; and (iii) must be absent in SAG and Pbi viruses because they lack a D-Rha in that position. The only PBCV-1 gene that fits all of these requirements is a071r. The protein encoded by this gene, A071R (NP_048419.1), is of sufficient size (354 aa) to be a GT, although it has not been annotated as such. However, A071R does have a sequence similar to A064R domain 2, which also has not been annotated as a GT but for which there is strong evidence that it is a rhamnosyl GT (discussed below).

Class C mutants and the role of A075L
Class C antigenic variants also have mutations that prevent the addition of the distal D-Xyl, which is necessary before the addition of the two L-Rha units. As noted above in the discussion of the Class D large deletion mutants, the protein that attaches the distal D-Xyl to the L-Fuc must be encoded by a gene that lies between the genes a014r and a078l. The genetic evidence supports the hypothesis that the

Virus-encoded glycosyltransferases
protein A075L (NP_048423.1) attaches D-Xyl to L-Fuc. This protein was previously annotated as an exostosin (Pfam database code PF03016.8), classified within the GT47 family. Six Class C antigenic mutants were genomically sequenced (E1L1, E1L2, E1L3, E1L4, P41, and E11), and all six have mutations in the a075l gene, but none of them shared mutations in any of the other genes that putatively encode GTs. The mutations in the gene a075l were of three types: (i) frameshift mutations that cause premature termination of the synthesis of the A075L protein; (ii) nonsense mutations that lead to premature termination of A075L protein synthesis; and (iii) in one case a missense mutation in the initiator codon that results in a failure to initiate A075L protein synthesis. One of the Class C mutants, E11, is of particular interest. E11 is derived from the Class B antigenic mutant EPA-1, but the mutation common with EPA-1 became masked when E11 acquired a second mutation, located in gene a075l, which prevents the addition of the distal D-Xyl, the feature that distinguishes the Class C mutants from those in Class B.
Additional evidence that supports the role of A075L comes from seven other chlorovirus species whose MCP glycan structures have been determined (14 -16). Even though these seven chloroviruses infect four different algal hosts, all of them have the same glycan core structure and the same distal D-Xyl attached to the L-Fuc, as does the WT PBCV-1. These seven chloroviruses all encode orthologs of the PBCV-1 A075L protein.

Class B mutants and the role of A064R domain 1 (A064R-D1)
Class B antigenic variants lack the terminal two L-Rha moieties ( Fig. 1) but have the terminal D-Xyl because they all have a functional a075l gene; however, they have mutations that prevent further elongation of the oligosaccharide chain. Three Class B antigenic variants have been genomically sequenced (EPA-1, EPA-2, and P31), and all have mutations in the a064r gene, which encodes the protein A064R (NP_048412.1). This protein consists of 638 amino acids that are organized into at least three functional domains of ϳ200 amino acids each (Fig.  1). More specifically, all three Class B antigenic variants have mutations in the region of the a064r gene that encodes domain 1 of A064R.
Previous studies annotated this domain as a "fringe class" GT and hypothesized that the DXD motif coordinated the phosphate of the nucleotide sugar, presumably UDP-Glc, and a divalent cation, Mn 2ϩ (21,22). The specificity of this domain could not be inferred at that time because the structure(s) of the N-glycans was unknown, preventing any hypothesis about the nature of the possible acceptor and donor substrates.
Understanding the function of domain 1 was inferred by comparing the N-glycans of Class B with those of Class A variants. These two sets of N-glycans (Fig. 1) differ in the presence of the ␤-L-Rha unit I (or L), supporting the hypothesis that the first domain of A064R encodes a ␤-L-rhamnosyltransferase that adds this monosaccharide to O-4 of a Xyl unit M.
As a possible nucleotide donor, we hypothesize that A064R-D1 uses UDP-L-rhamnose as substrate, whereas most prokaryotic enzymes recognize the dTDP-bound sugar. This difference can be explained by the fact that, in contrast to GDP-L-fucose or GDP-D-Rha that can be produced by virus-derived enzymes (23), a UDP-L-rhamnose biosynthetic pathway is not encoded by the PBCV-1 genome. Thus, the nucleotide-activated L-rhamnose is presumably synthesized by the chlorella host, using a typical plant UDP-L-rhamnose biosynthetic pathway (24). This hypothesis is corroborated by both docking studies (see below) and biochemical evidence that A064R domain 1 has this predicted activity. 4

Class A mutants and the role of A064R domain 2
Class A antigenic variants only lack the terminal ␣-L-Rha. Three variants of this antigenic class have been genomically sequenced (P91, EPA-11, and P9L1), and all of them lack mutations in domain 1 of a064r, but all three have mutations in domain 2 of this gene (Fig. 1), which encodes a bacterially derived protein of unknown function. Furthermore, none of these three variants has mutations in the gene that encodes A075L. Thus, these three members of class A are able to add the terminal D-Xyl and the ␤-L-Rha to the core glycan (Fig. 1). Two of the three mutants, P9L1 and EPA-11, have nonsense mutations (point mutations that become stop codons) that lead to premature termination of translation. The third mutant, P91, has a 27-nucleotide insertion in the region of a064r that encodes domain 2. In addition, the a064r gene of another class A mutant, P9L10, was amplified by PCR and sequenced. P9L10 acquired a nonsense mutation in a064r that results in premature termination of A064R near the N terminus of domain 2; consequently, the functions of both domain 2 and 3 are lost.
In summary, all of the class A variants encode a functional domain 1 of A064R but are unable to encode a functional domain 2. These results are consistent with the glycan structure results, which support the hypothesis that domain 2 is involved in the attachment of the second Rha unit (C or B), the one with an ␣ glycosidic linkage, placed at O-2 of the first L-Rha unit (I or L). For the reasons mentioned for A064R-D1, our working hypothesis is that the second domain of A064R accepts UDP-L-Rha as donor.

Class F mutant and the role of A064R domain 3
The antigenic variant CME6, whose genome has been sequenced, is the only member of antigenic class F ( Table 1). The CME6 a064r gene has a TC dinucleotide insertion at genome position 36,276, which leads to premature chain termination during translation of domain 3, whereas the first two domains are unaffected. Accordingly, the N-glycans of this variant contain all of the sugars that comprise the WT glycan but lack the methyl groups at the O-2 and O-3 positions of the terminal Rha unit C or B (Fig. 1). This finding confirms the role of the first two domains of A064R; namely the first domain attaches ␤-L-Rha (I or L) to ␤-Xyl M, and domain 2 attaches the last Rha unit (C or B) to the ␤-L-Rha.
Whereas PBCV-1 encodes numerous DNA methyltransferase genes, the only mutated methyltransferase gene in CME6 is in the region of gene a064r that encodes domain 3 (Fig. 1). The best hit in a BLAST search of the WT amino acid sequence 4 Unpublished observations.

Virus-encoded glycosyltransferases
of domain 3 to the nucleotide database at NCBI was to 2Ј-Orhamnosylmethyltransferases. Indeed, when the rhamnosylmethyltransferase from Mycobacterium smegmatis (25), was BLASTED against the PBCV-1 protein database, the only hit was to domain 3 of A064R (33% identity and 49% positive). Consequently, when a064r is mutated in the third domain, the corresponding N-glycan is not methylated at either the O-2 or O-3 position, suggesting that this methyltransferase can perform a double methylation of the monosaccharide. However, we cannot exclude the possibility that domain 3 attaches only one methyl group and that a second methyltransferase, yet to be identified, is required to complete the full decoration of this Rha.
The mutant P91, a Class A variant, provides additional support that domain 3 is a rhamnosylmethyltransferase. In the case of P91, domain 3 of A064R is translated correctly because the 27-nucleotide insertion in the second domain of a064r maintains the correct reading frame during its translation. Analysis of the P91 glycopeptide mixture detected a minor form with ␤-L-Rha methylated at O-2 (I؆) (Fig. S7). Thus, domain 3 appears to be functional in P91, but it lacks its natural substrate because the ␣-L-Rha unit C (or B) is not assembled at the end of the glycan (see PBCV-1 structure in Fig. 1 and P91 glycan structure in Fig. S7), but there is a ␤-L-Rha for which it may have a reduced affinity.

Modeling of UDP-Rha into the catalytic domain 1 of A064R
Based on the evidence collected in this study, A064R-D1 is predicted to use UDP-L-Rha as the activated donor to synthesize the ␤-L-Rha-(1,4)-␤-D-Xyl linkage. In a previous study (22), this domain was cloned and crystallized, and the authors tested several nucleotide sugars, but not UDP-L-Rha. The best affinity was found for UDP-Glc. Therefore, a molecular modeling approach was used to determine whether UDP-L-Rha is an acceptable substrate for this GT.
Docking methods are widely used to define or predict the binding mode of a small ligand to a receptor. Although these methods are efficient in discriminating between binders and nonbinders, they are not particularly accurate and typically cannot discriminate between substrates that differ by less than 1 order of magnitude in affinity (26). This is the case for the possible Rha/Glc relative selectivity for A064R-D1. Subtle differences originate from contributions of hydrogen bonds and CH/ staking interactions. Thus, a molecular dynamics (MD) approach was used (27, 28). The structure deposited in the protein database (PDB accession number 2P72) contains UDP-Glc as substrate. Therefore, its behavior was compared with that computed for the UDP-L-Rha analogue.
In the X-ray structure, the donor UDP-Glc is stabilized in the binding site by several hydrogen bonds and Mn 2ϩ -oxygen interactions. Specifically, upon ligand binding, the protein undergoes a major conformational change, which shifts Phe-13 toward the uracil ring of UDP. Two hydrogen bonds between the Phe-13 backbone atoms and the uracil ring are established. These same hydrogen bond interactions are established (Fig. 3,  compare panel A with D and panel B with E) in the model of A064R-D1 bound to UDP-L-Rha and kept throughout the entire simulation time (Fig. S9). Similarly, the hydrogen bond network described for the ribose moiety in the X-ray structure is also maintained during the whole MD simulation (Fig. 3 (B  and E) and Fig. S9). Specifically, the carbonyl oxygen of Gly-11 interacts with the OH-2 of ribose, whereas the NH of Ser-79 is a hydrogen bond donor to O-3 of ribose. The hydrogen bond between the carboxyl side chain of Asp-78 and the OH-3 of ribose is established only in the last 25 ns of the MD simulation (Fig. S9).
It has previously been described that His-54 experiences a significant conformational change upon UDP-Glc binding (Fig.  3C), suggesting a key role for this residue as the required catalytic base (22). Along this line, the MD simulation of A064R-D1 with UDP-Rha clearly identifies a stacking interaction among the histidine electron cloud and the C-H moiety of the less polar face of Rha (Fig. 3F). This CH/ interaction is stable in all of the MD simulations (Fig. S9). These findings are in full agreement with the previous hypothesis (22) about the role of N⑀ in His-54, which stabilizes the partial positive charge that develops during the reaction at the C-1 atom. In fact, the C-1-N distance is only 4.5 Å on average (Fig. 3F and Fig. S9). Finally, UDP-L-Rha is kept in the catalytic site by a polar network involving Arg-57 and Pro-149, which makes hydrogen bonds with OH-4 of the sugar ring ( Fig. 3F and Fig. S9).
Next, the free energy of binding for both UDP-L-Rha and UDP-Glc to A064R-D1 was estimated by MM/GBSA methods (25,26). Interestingly, the computations predict that the binding free energy for the A064R-D1/UDP-L-Rha complex is ϳ6 kcal/mol lower than that for the A064R-D1/UDP-Glc analogue  (Table 2), in agreement with the prediction of the specificity of this GT and our preliminary experimental results. 4 Among the key components of the binding free energy, UDP-Glc displayed the largest electrostatic energy changes upon binding (⌬E ELE ) and also showed the largest polar solvation energy (⌬G Sol ). Thus, the computational analysis strongly suggests that the major factor favoring the binding of A064R-D1 for UDP-L-Rha over UDP-Glc is the lower polar character of Rha with respect to Glc. These results are also in agreement with the low polarity of the A064R-D1-binding site defined by Pro-149, Phe-148, and Tyr-150, which perfectly fit with the methyl substituent of Rha through hydrophobic interactions. Furthermore, we investigated why the antigenic variant EPA-1 lacks the terminal L-Rha moieties, by docking the A064R_S79L mutant with UDP-Rha substrate. The results of the analysis show that the bulky side chain of the Leu-79 entails the reorganization of the DXD motif, with a consequent loss of Mn 2ϩ coordination and most of the essential interactions with the substrate. Specifically, although the hydrogen bonds between the Phe-13 and the uracil ring are preserved (Fig. 3G), the hydrogen bond network for the ribose moiety is completely abrogated (Fig. 3H). Even more, His-54 is now ϳ90°flipped, precluding CH/ interactions (Fig.  3I). These structural data provide an explanation for the observed truncated oligosaccharide chain in the EPA-1 antigenic variant.

Discussion
Unlike the great majority of viruses, the chlorovirus PBCV-1 encodes most, if not all, of the machinery to glycosylate its MCP. In this study, the truncated glycan structures from a set of PBCV-1 antigenic variants were determined. These structures plus genetic evidence led to the identification of four GT activities encoded by three PBCV-1 genes, a064r, a071r, and a075l, and enabled us to predict their functions. The a064r gene encodes a 638-amino acid protein with three domains: domain 1 is a putative ␤-L-rhamnosyltransferase; domain 2 is a previously unknown GT, here defined as probably being an ␣-Lrhamnosyltransferase; and domain 3 is a methyltransferase that decorates one or both positions of the terminal ␣-L-Rha unit. Therefore, we predict that the A064R protein produces the disaccharide 2,3-diOMe-␣-L-Rha-(1,2)-␤-L-Rha attached to O-4 of the distal Xyl unit (Fig. 4). Gene a071r is predicted to encode a GT that attaches a semiconserved ␣-D-Rha to O-3 of the L-Fuc located in the N-glycan core region (Fig. 4). Gene a075r is predicted to encode the ␤-xylosyltransferase that attaches a D-Xyl unit to O-4 of the same L-Fuc unit used as a substrate by A071R (Fig. 4). Finally, the putative GT(s) encoded by the a111/114r gene is predicted to be involved in the synthesis of four of the five residues composing the conserved glycan core structure (Fig. 4), because it is one of the few annotated GT encoding genes that is present in all the chloroviruses. However, it is not known whether A111/114R transfers one or more of the sugars of the tetrasaccharide described in variant P1L6 (Fig. 1). We cannot exclude the possibility that one or more other unknown PBCV-1-encoded proteins besides A111/114R are involved in the assembly/attachment of the core glycan structure to the MCP. Alternatively, a host-encoded enzyme(s) could also be involved in the synthesis/attachment of the glycan core structure.
Virally encoded glycosylation is an emerging concept in virology, and it is not confined to virus PBCV-1. Other chloroviruses possess the genetic tools to accomplish this task, and the same scenario probably occurs in other taxonomically unrelated viruses. Indeed, bovine herpesvirus and myxoma virus encode a ␤-(1,6)-GlcNAcand a ␣-(2,3)-sialyltransferase, respectively (29). In addition, we predict that many of the giant viruses, characterized by large particle size and very large genomes (30), will encode at least some, if not all, of the enzymes needed to glycosylate their glycoproteins.
The list of giant viruses includes all of the known members of the Mimiviridae and Pandoraviridae families, along with Pithovirus and Mollivirus sibericum (30) and many others that are rapidly being discovered.
The giant viruses mentioned above differ in morphology, genome size, and physical size, but they share a common trait in Table 2 Calculated

binding free energies (kcal/mol) for UDP-Rha (UDP-Rha) and UDP-Glc (UDP-Glc) to A064R-D1
The binding free energy (⌬G bind ) is decomposed into different energy terms: the gas-phase interaction energy (⌬E gas ) between the receptor and the ligand is the sum of electrostatic (⌬E ELE ) and van der Waals (⌬E VDW ) interaction energies. The solvation free energy (⌬G sol ) is divided into the polar (⌬G GB ) and non-polar (⌬G Surf ) energy terms.

Virus-encoded glycosyltransferases
that their genomes have several glycogenes (i.e. genes encoding enzymes able to manipulate sugar-nucleotides and GTs). Past work has demonstrated that mimivirus encodes three of the enzymes needed to synthesize UDP-GlcNAc (31), along with a functional pathway for UDP-Rha (32) and for UDP-N-Acetylviosamine production (33,34) and a functional glycogenin paralogue (35). For many of the giant viruses, except for the glycogenin of mimivirus (35), the activities of the putative glycogenes are yet to be determined. If our prediction about giant viruses is correct, investigations of these organisms will provide the field of glycobiology with new discoveries and may ultimately answer the central question about what benefits these viruses have by maintaining their own glycosylation machinery through evolution.
Finally, interest in the new GTs described in this report goes beyond the viral world, as they have similarities with prokaryotic and eukaryotic enzymes, as exemplified by A064R-D1, for which the best matches are with LgtA, a prokaryotic protein, and with a bovine ␣-(1,3)-galactosyltransferase (22).

Virus growth and glycopeptide isolation from major capsid proteins
Antigenic variants of PBCV-1 (P9L10, P9L1, P91, EPA-1, EPA-2, E11, E1L3, P1L6, and CME6) were grown by infection of 4 liters of C. variabilis NC64A cells and purified as described previously (36) with some modifications as follows. Lysates were exposed to 1% Triton X-100 before two successive rounds of gradient centrifugation in 10 -40% sucrose. The virus was treated with proteinase K (0.02 mg/ml) between the sucrose gradients to remove external contaminating proteins. The isolation of their MCPs followed an established protocol (16). Isolated amounts of MCP ranged from 0.2 to 1.5 mg. To isolate the glycopeptides, the MCP from each variant was dissolved in water and digested with proteinase K (Sigma catalog no. P6556). The digestion process was conducted three times over a time interval of 36 h using proteinase K at ϳ20% by weight of the starting amount of the MCP. The glycopeptide mixture was purified via size-exclusion chromatography as reported (11). Each antigenic variant produced one main glycopeptide fraction, eluting at ϳ50% of the column volume, with the exception of virus E1L3, which produced two glycoforms. The less abundant form was about 5-10% of the total mixture; its proton spectrum was similar to that of the WT virus, and it was not investigated further in this study. For the P1L6 virus variant, due to the low amount of the virus available, the purification step was omitted to avoid further losses. This glycopeptide was analyzed directly via NMR spectroscopy; signals arising from the nonglycosylated peptides did not hinder the analysis of the glycan portion.

Glycopeptide chemical analysis
Partially methylated acetylated alditols and fully acetylated alditols were generated from the purified glycoprotein (ϳ0.2 mg) or on the glycopeptides according to published protocols (37). Analyses of these derivatives were carried out via GC-MS on an Agilent Technologies Gas Chromatograph 6850A interfaced with a selective Mass Detector 5973N equipped with a SPB-5 capillary column (Supelco, 30 m ϫ 0.25-mm inner diameter, flow rate 0.8 ml min Ϫ1 using helium as the carrier gas). Electron impact mass spectra were recorded with an ionization energy of 70 eV and an ionizing current of 0.2 mA. The following temperature program was used: 150°C for 5 min, 3°C/min up to 300°C, 300°C for 5 min.

NMR acquisition conditions
All NMR experiments were carried out on a Bruker DRX-600 spectrometer equipped with a cryoprobe, calibrated with acetone used as an internal standard (␦ H ϭ 2.225 ppm; ␦ C ϭ 31.45 ppm), and recorded in D 2 O at 310 K (CME6, E11, EPA1, EPA2, P9L1, P9L10, and E1L6), at 316 K (P91), or at 323 K (E1L3) to shift the signal of the residual water peak to minimize its overlap with some of the glycopeptide signals. The complete set of two-dimensional spectra (DQ-COSY, TOCSY, T-ROESY, gHSQC, and gHMBC) were recorded for all glycopeptides, except for PIL6, P9L1, and E11, for which the gHSQC experiment was enough to establish the structure of the glycan. The spectral width was set to 10 ppm, and the frequency carrier was placed at the residual HOD peak, suppressed by presaturation. Two-dimensional spectra were measured using standard Bruker software; for all of the experiments, 512 FIDs of 2,048 complex data points were collected, 32 scans per FID were acquired for homonuclear spectra, and mixing times of 100 and 300 ms were used for TOCSY and T-ROESY spectra acquisition, respectively. Heteronuclear 1 H-13 C spectra were measured in the 1 H detection mode, gHSQC spectrum was acquired with 30 -60 scans per FID (depending on the abundance of the sample), the GARP sequence was used for 13 C decoupling during acquisition, and gHMBC scans doubled those of the gHSQC spectrum.

Bioinformatics
Geneious version 11.0.5 software (38) was used to deepen the genetic analysis of the variants to identify other genes involved in glycosylation or to define the nature of the mutation. The software was used to assemble and map antigenic mutant-sequencing reads (total genome) to the PBCV-1 WT reference genome. The software allowed for quick and easy identification of genetic variants in the antigenic mutant viruses.

MD simulation of UDP-Rha and UDP-Glc with domain 1 of A064R
The starting structure for MD simulation of the GT A064R D1 in complex with the substrate UDP-L-Rha was generated by replacing the Glc unit of the UDP-Glc with L-Rha in the complex with the enzyme (PDB code 2P72). The structures of the UDP-Rha and UDP-Glc were relaxed by energy minimization with a low gradient convergence threshold (0.05) in 1,000 steps with MM3* force field integrated in the MacroModel program run under Maestro (release 2016-4) (Schrödinger, LLC, New York, NY). The nucleotide moieties of the minimized UDP-L-Rha and UDP-Glc were superimposed on the UDP molecule in the structure of A064R in complex with UDP (PDB code 2P73). Partial charges of the ligands were calculated using the Gaussian 09 package with HF/6 -31G(d) basis set. All molecular parameters were converted using the Antechamber program in Virus-encoded glycosyltransferases the Amber16 package. The resulting structures were used as starting points for molecular dynamics simulation of 100 ns in explicit water solvent, using the Amber16 program with the protein.ff14SB force field parameters for protein and the generalized amber force field gaff2 for the sugar nucleotide, whereas the water.tip3p force field was used for water and ions. The atomic parameters for the Mn 2ϩ ion were defined according to published data (39). The starting structures were solvated in a 10-Å octahedral box of explicit TIP3P waters, and counterions were added to maintain electroneutrality. To fill all of the protein cavities with water molecules, a previous minimization for only solvent and ions was made. To reach a low-energy starting structure, the entire system was minimized with a higher number of cycles, using the accurate steepest descent algorithm. The system was subjected to two rapid molecular dynamic simulations (heating and equilibration) before starting the real dynamic simulation: (i) 20 ps of MD heating the whole system from 0 to 300 K, using NVT (a fixed number of atoms (N), a fixed volume (V), and a fixed temperature (T)) ensemble and a cutoff of 10 Å, followed by (ii) equilibration of the entire system over 100 ps at 300 K using NPT (a fixed number of atoms (N), a fixed pressure (P), and a fixed temperature (T)) ensemble and a cutoff of 10 Å. A relaxation time of 2 ps was used to equilibrate the entire system in each step. The equilibrated structure was the starting point for the final MD simulations at constant temperature (300 K) and pressure (1 atmosphere). Molecular dynamics simulation without constraints was recorded, using an NPT ensemble with periodic boundary conditions, a cutoff of 10 Å, and the particle mesh Ewald method. Coordinates and energy values were recorded every 10,000 steps (20 ps) for a total simulation time of 100 ns and an ensemble of 5,000 MD frames. A detailed analysis of the MD trajectory (for example, root mean square deviation evaluation, kinetic and potential energies, etc.) was accomplished using the cpptraj module included in the AmberTools16 package, and it is gathered in the supporting material (Fig. S9). The structure of the GT from antigenic variant EPA-1, GT A064R D1_S79L mutant, in complex with the substrate UDP-L-Rha was generated by the mutagenesis tool integrated into the PyMOL program (PyMOL Molecular Graphics System, version 2.0, Schrödinger, LLC). The generated structure was minimized using the Amber16 program as described above.

MM/GBSA calculations
Binding free energy analysis was performed by applying the MM/GBSA calculation. The molecular mechanic energies combined with the generalized Born and surface area continuum solvation method were applied to the structure of UDP-L-Rha and UDP-Glc, A064R-D1, and the UDP-L-Rha/A064R-D1 and UDP-Glc/A064R-D1 complexes generated from the MD simulations, as described above. The receptor/ligand complexes were energetically minimized by the MM/GBSA method as implemented in the Sander program of the Amber package. The atomic radii (set default PBRadii mbondi2) were chosen for all GBSA calculations. For the ensemble-average MM/GBSA rescoring, an MD trajectory was obtained for both UDP-L-Rha/ A064R-D1 and UDP-Glc/A064R-D1 complexes. Snapshots of the complex were extracted from the MD trajectory at every 10 ps. Thus, the final binding free energy is an average of the binding free energies of 10,000 snapshots.