Determination of the Site-specific O -Glycosylation Pattern of the Porcine Submaxillary Mucin Tandem Repeat Glycopeptide MODEL PROPOSED FOR THE POLYPEPTIDE:GalNAc TRANSFERASE PEPTIDE BINDING SITE*

The heterogeneously glycosylated 81-residue tryptic tandem repeat glycopeptide from porcine submaxillary mucin (PSM) has been isolated and its glycosylation pattern determined by amino acid sequencing. Key to these studies is the ability to trim the structurally heterogeneous PSM oligosaccharide side chains to homogeneous GalNAc monosaccharide side chains by mild trifluoromethanesulfonic acid treatment. Trypsin treatment of trifluoromethanesulfonic acid-treated PSM releases the 81-residue tandem repeat as an ensemble of 81-residue glycopeptides with different glycosyla- tion patterns. Automated amino acid sequencing using Edman degradative chemistry of the repeat was used to determine the extent of glycosylation of nearly every Ser and Thr residue. The Thr residues are all highly glycosylated within the range of 73–90%, giving an average Thr glycosylation of 83%. In contrast, the Ser resi- dues display a wide range of glycosylations, ranging between 33 and 95%, giving an average Ser glycosylation of 74%. These data are consistent with the known elevated glycosylation of Thr peptides over Ser peptides for the porcine UDP- N -acetylgalactosamine:polypeptide N- acetylgalactosaminyltransferase. It is also observed that the extent Acid phase

The heterogeneously glycosylated 81-residue tryptic tandem repeat glycopeptide from porcine submaxillary mucin (PSM) has been isolated and its glycosylation pattern determined by amino acid sequencing. Key to these studies is the ability to trim the structurally heterogeneous PSM oligosaccharide side chains to homogeneous GalNAc monosaccharide side chains by mild trifluoromethanesulfonic acid treatment. Trypsin treatment of trifluoromethanesulfonic acid-treated PSM releases the 81-residue tandem repeat as an ensemble of 81-residue glycopeptides with different glycosylation patterns. Automated amino acid sequencing using Edman degradative chemistry of the repeat was used to determine the extent of glycosylation of nearly every Ser and Thr residue. The Thr residues are all highly glycosylated within the range of 73-90%, giving an average Thr glycosylation of 83%. In contrast, the Ser residues display a wide range of glycosylations, ranging between 33 and 95%, giving an average Ser glycosylation of 74%. These data are consistent with the known elevated glycosylation of Thr peptides over Ser peptides for the porcine UDP-N-acetylgalactosamine:polypeptide N-acetylgalactosaminyltransferase. It is also observed that the extent of glycosylation of the repeat correlates poorly with published predictive methods. An examination of the sequences surrounding the glycosylation sites reveals that nearly all of the highly glycosylated sites have a penultimate Gly residue, whereas those that are less highly glycosylated have medium to large side chain penultimate residues. As observed by others, glycosylation also appears to be modulated by the presence of Pro residues. On the basis of these findings we suggest that the acceptor peptide binds the transferase in a ␤-like conformation and that penultimate residue side chain steric interactions may play a role in determining extent that a given Ser or Thr is glycosylated. A model for the GalNAc transferase peptide binding site is proposed.
Mucin glycoproteins are heavily O-glycosylated glycoproteins secreted by higher organisms that serve vital roles, pro-tecting and lubricating epithelial cell surfaces from biological, chemical, and mechanical insult. Mucins and mucin-like molecules attached to membrane and cell surfaces play additional important roles by modulating for example immune response, inflammation, and tumorigenesis (1)(2)(3)(4)(5). The O-glycosylated domains of mucins and mucin-like glycoproteins typically contain 50 -80% carbohydrate and possess expanded conformations. These regions typically contain high Ser and Thr contents, commonly composed of polypeptide tandem repeats containing clusters of Ser and Thr residues. It has been demonstrated, in the case of mucins, that the O-linked oligosaccharide side chains, attached via ␣-N-acetylgalactosamine (GalNAc) 1 to Ser and Thr, are solely responsible for their 3-fold expanded peptide chain dimensions (6). Chemical and NMR studies indicate that on average 75% or more of the Ser and Thr residues in mucins are glycosylated (7,8). Little, however, is known of the actual distribution of the carbohydrate in these clusters along the mucin polypeptide core. In addition, it is unknown whether the distribution of oligosaccharides along the peptide core is random or whether specific Ser/Thr residues are preferentially glycosylated over others. Obtaining this information by peptide mapping approaches would be a significant analytical undertaking due to the vast array of glycopeptides with different glycosylation patterns and oligosaccharide structures that would be needed to be quantitatively isolated and characterized.
Several predictive methods, based on the analysis of reported O-glycosylation sites compiled from protein data bases, are available for estimating the relative propensity for a given Ser or Thr to be glycosylated (9 -11). Application of these predictive methods to the highly glycosylated mucins may not be fully valid because mucins and mucin-like molecules were not a significant component of the "training" data sets, which were dominated by globular glycoproteins, and where presented, the data on mucins and mucin-like glycoproteins glycosylation are commonly incomplete or in error. Furthermore, the propensity for O-glycosylation of Ser and Thr residues at the surface of globular glycoproteins may vary considerably from those found in mucin-like glycoproteins, which presumably serve different structural functions.
In vitro O-glycosylation studies using synthetic peptide acceptors have also been utilized to determine the propensity for a given Ser and Thr to be O-glycosylated (12)(13)(14)(15)(16)(17)(18)(19). Unfortunately, the isolated peptide ␣-N-acetylgalactosaminyltrans-ferases give different extents of Ser/Thr glycosylation on peptide substrates compared with that observed in vivo (14,20). This may be the result of altered enzyme specificity as a result of inappropriate solution conditions, absence of cofactors, the presence of more than one transferase (21)(22)(23), or the result of the artifacts resulting from the use of relatively small peptides containing charged N and C termini. Only by characterizing and quantifying the specific in vivo glycosylation of native and/or expressed recombinant proteins and peptides (20) is it likely that valid and useful data will be obtained for determining the true in vivo transferase specificities.
To begin to address the structural effects of O-glycosylation and to determine the extent that mucin O-glycosylation is modulated in vivo in a site-specific manner, we have undertaken the isolation and characterization of the porcine submaxillary gland mucin (PSM)-glycosylated tandem repeat. We have determined the glycosylation pattern of the isolated 81-residue tandem repeat and have isolated and characterized several smaller PSM tandem repeat-derived glycopeptides (24). This work provides the first detailed analysis of the glycosylation pattern of a mucin and suggests that mucin glycosylation is modulated by peptide sequence but not entirely as expected by the existing O-glycosylation prediction algorithms. Based on the observed glycosylation pattern a model for the GalNAc transferase peptide binding site has been proposed.

Materials
␣-GalNAc-Thr and ␣-GalNAc-Ser were a kind gift of R. Koganti, Biomera Inc. Edmonton, Alberta, Canada. Except where noted, all chemicals and enzyme reagents were obtained from Sigma or Fisher.

Methods
Isolation of PSM-Porcine submaxillary gland mucin was obtained from frozen porcine submaxillary glands, in gram quantities, as described by Shogren et al. (25).
Reduction, Carboxymethylation, and Trypsinolysis of PSM-PSM was reduced and carboxymethylated by the methods of Gupta and Jentoft (32) giving R-PSM. R-PSM (1 g/50 ml) was digested with L-1tosylamido-2-phenylethyl chloromethyl ketone-trypsin (15 mg/1 g R-PSM) (Worthington) overnight at 37°C in 50 mM ammonium bicarbonate, pH 8.3. A second aliquot of trypsin was added to ensure complete digestion and incubated 5-8 h. Toluene was added to both incubation solutions to prevent microbial growth. After exhaustive dialysis and the removal of insoluble debris by centrifugation, trypsinized R-PSM (TR-PSM) was lyophilized.
Partial Deglycosylation by Trifluoromethanesulfonic Acid (TFMSA) and Isolation of PSM Tandem Repeats-Lyophilized and dissectordried TR-PSM (ϳ0.75 g in 75-ml Teflon screw-cap tubes) was reacted for 6 -16 h with a TFMSA (50 g) anisole (15 ml) (Aldrich) mixture at 0°C following the approach of Gerken et al. (26,see also Ref. 27). To reduce heating effects, both the reagents and lyophilized TR-PSM were chilled in dry ice/ethanol prior to their mixing. After incubation with occasional vigorous shaking, the reaction was again chilled, and 1 volume of cold anhydrous diethyl ether was added. This mixture was slowly added to 125 ml of a frozen slush of 60% pyridine, after which the solution was warmed to room temperature and extracted with ether. The aqueous phase containing the partially deglycosylated TR-PSM (TTR-PSM) was dialyzed exhaustively and lyophilized.
Low molecular weight non-glycosylated peptides were separated from the glycosylated tandem repeat subunits by gel filtration chromatography on Sephacryl S200 (Pharmacia Biotech, Uppsala, Sweden) (column dimensions 5 ϫ 55 cm, 7-ml fraction volumes) eluted with 50 mM ammonium bicarbonate buffer. Glycoprotein content was monitored by periodic acid-Schiff reagent (28), absorbance at 555 nm, and protein monitored by the absorbance at 220 nm.
The high molecular weight carbohydrate-containing fraction eluting near the void volume of the S200 column was lyophilized and treated a second time with L-1-tosylamido-2-phenylethyl chloromethyl ketonetrypsin (10 mg/g TTR-PSM) using the conditions described for the initial trypsin treatment. The digested TTR-PSM (TTTR-PSM) was fractionated on S200, and the PSM tandem repeats were isolated as the major included glycopeptide fraction (TTTR-PSM-T3) and lyophilized.
Glu-C Digestion of PSM Tandem Repeat-The TTTR-PSM-T3 tandem repeat (30 mg) in 5 ml of 25 mM ammonium bicarbonate, pH 7.8, was digested with 1 mg of protease Glu-C (Boehringer Mannheim) for 20 h at 25°C. After a second addition of Glu-C and further digestion, the mixture was fractionated by S200 chromatography. The major glycopeptide peak (GTTTR-PSM) was separated into the major N-and C-terminal tandem repeat peptides on reverse phase HPLC.
HPLC Purification of PSM Tandem Repeat and Glu-C-digested Tandem Repeat-The TTTR-PSM-T3 tandem repeat and Glu-C-digested repeat (GTTTR-PSM) were further purified by reverse phase HPLC chromatography on a 0.46 ϫ 15 cm C18 ODSII column (Alltech Associates Inc., Dearfield IL) using 0.05% trifluoroacetic acid, water/acetonitrile gradients as described in the figure legend. All isolations were performed on a Varian 5000 HPLC system (Varian Associates, Walnut Creek CA) equipped with a Schamitsu UV/VIS detector.
Amino Acid Sequencing-Pulsed liquid phase Edman degradation amino acid sequencing of the isolated PSM tandem repeats and Glu-Cderived glycopeptide was performed on either an Applied Biosystems 477A or Applied Biosystems Procise 494 protein sequencer (Perkin-Elmer) typically using standard manufacturer recommended pulsedliquid cycles (24). Samples of 2000 -5000 pmol were dried on trifluoroacetic acid washed glass fiber filters (ABI number 401111) spotted with 1.5 mg of BioBrene Plus (ABI 400385). Amino acid phenylthiohydantoin (PTH) derivatives were chromatographed on standard ABI 5-m C18 PTH columns using the Fast Normal 1 gradient program and were monitored by the absorbance at 269 nm. The PTH-Thr/Ser-O-GalNAc derivatives were found to elute as two diastereotopic peaks at unique positions in the chromatogram (24). Due to increased peak broadening and changes in elution position as the PTH columns age, some variability and overlap of the glycosylated PTH derivatives with Ser, Thr, Gly, and Asp PTH derivatives were commonly observed. Since the ratio of the areas of the diastereotopic PTH-Ser/Thr-O-GalNAc derivatives was found to be relatively constant (typically 45/55 for Thr), the extent of glycosylation usually could be determined by the use of simple algebra. Modifications in the HPLC gradient were found to produce only marginal improvement in the separation of the PTH-Ser/Thr-GalNAc derivatives from the elution positions of neighboring PTH-derivatives. Prior to data analysis, long term cycle preview and lag for each PTHderivative was eliminated by a base-line subtraction approach. This was performed by subtracting the 5-cycle minimum value running average across the entire sequencing run for each peak. Due to the length of the peptides and high content of similar amino acids (i.e. Gly, Ser, Thr, and Ala), no attempts were made to include quantifications for adjacent residue cycle lag or preview. Response factors for the PTH-Ser/Thr-O-GalNAc derivatives were obtained from shorter glycopeptide sequencing experiments (24) and were found to be similar to those of Ser and Thr. Response values for Ser, Thr, and their glycosylated PTH-derivatives were further adjusted for each sequencing run to obtain consistent picomole yields relative to the non-Ser/Thr residues.
Prediction of O-Glycosylation Sites and Secondary Structure-The PSM tandem repeat sequence was analyzed for potential O-glycosylation sites using software kindly provided by Dr. A. Elhammer (9) and by using the E-mail NetOglyc server of Hansen et al. (10). The sequence coupled vector projection predictions were kindly performed by Dr. K. Chou (11). Peptide secondary structure predictions were performed by the SOPM internet server (29). Predictions were performed for at least 10 residues beyond the tandem repeat N and C-terminal boundaries to eliminate end effects.
Peptide Modeling-Peptides were modeled using the Biopolymer module of InsightII (MSI Inc. formally Biosym Technologies, San Diego CA). Tandem Fig. 1A. The mucin's structure is dominated by the presence of highly O-glycosylated, multiple repeating 81-residue tandem repeats. These repeats make up the vast majority of the mucin's amino acid sequence and represent the major glycosylation sites in PSM.

Isolation of PSM
Tryptic Tandem Repeat Glycopeptide-There are two potential trypsin cleavage sites, Arg-Ile and Arg-Pro, in each PSM tandem repeat. The Arg-Pro site, however, is expected to be inactive; therefore, trypsin treatment is expected to yield single copies of the 81-residue tandem repeat with the sequence given in Fig. 1B. Indeed, Timpte and co-workers (30) have shown that after full deglycosylation and digestion of the fully deglycosylated (apo) mucin with trypsin, this tryptic peptide is obtained. In contrast, Gupta and Jentoft (32) have shown that trypsin treatment of fully glycosylated, reduced and carboxymethylated mucin fails to yield single copies of the repeat and instead yields relatively high molecular weight glycopeptides composed of undigested tandem repeats. Apparently, the longer oligosaccharide side chains inhibit the digestion of the PSM polypeptide by trypsin, preventing the isolation of individual tandem repeats. The Sephacryl S200 gel filtration chromatogram of this species, TR-PSM, is given in Fig. 2A.
Only after quantitatively trimming the oligosaccharide side chains to the peptide-linked ␣-GalNAc residue by mild TFMSA treatment (26 -27), Fig. 2B, does the PSM glycopeptide core become susceptible to cleavage by trypsin, Fig. 2C, presumably releasing monomeric glycosylated tryptic tandem repeats (peak T3). 2 Mild TFMSA treatment does not appreciably degrade the mucin, since on gel filtration untreated and treated mucin (TR-PSM and TTR-PSM respectively) contain similar high molecular weight glycosylated peaks, Fig. 2, A and B. Integrations of the ␣-carbon resonances of the 13 C NMR spectra of native and TTR-PSM confirm that 96 to 97% of the glycosylated Ser and Thr residues of the native mucin retain intact unsubstituted ␣-GalNAc residues after mild TFMSA treatment (data not shown, see Ref. 26).
The suspected monomeric glycosylated tryptic tandem repeat, pooled as indicated in Fig. 2C, gave a single sharp peak after rechromatography on S200 as shown in Fig. 2D (Ⅺ). On reverse phase HPLC, Fig. 3A, this peak gives a single somewhat broadened peak, representing the ensemble of heterogeneously glycosylated tandem repeats. This peak was pooled for amino acid sequencing and for subsequent digestion by Glu C (discussed below). Amino acid sequencing of this peak confirmed the isolation of the PSM tryptic tandem repeat.
Since each step in the above procedure typically yields a single major glycopeptide species that is pooled for the subsequent step, the obtained tandem repeat glycosylation pattern (see below) is expected to represent the majority of the PSM tandem repeats of the native mucin. After correcting for the different carbohydrate contents of native and TFMSA-modified PSM (26), it is calculated that between 40 and 50% (uncorrected for the nonspecific losses of material at each step) of the initial TR-PSM peptide is isolated in the pooled TTTR-PSM-T3 fraction. The possibility exists, however, that a subpopulation of differently glycosylated species may have been excluded, since the lower molecular weight regions of the glycopeptide peaks in Fig. 2, B and C, were not pooled. These lower molecular weight species are thought to represent nonspecific (protease and TFMSA) degradation products of the tandem repeat and other non-tandem repeat glycopeptides arising from the Nand C-terminal domains of PSM (see Fig. 1A). Evidence for  (33). The PSM polypeptide consists of a very long highly extended O-glycosylated domain containing multiple copies, n, of an 81-residue tandem repeat which is followed by a relatively small, poorly glycosylated C-terminal domain. The glycosylated domain consists of several thousand residues comprising 25 or more tandem repeats (31). The C-terminal (and perhaps the N-terminal) domains have been shown to dimerize, m, thus accounting for the very large size of native mucin (33). B, the sequence of the tryptic PSM tandem repeat.

FIG. 2. Sephacryl S200 gel filtration chromatography showing the isolation of the PSM tryptic tandem repeat and its digestion by Glu-C.
For A-C protein content is monitored by the absorbance at 229 nm (Ⅺ) and carbohydrate content by periodic acid-Shiff, absorbance at 555 nm (ࡗ). A, S200 chromatogram of trypsinized, reduced, and carboxymethylated PSM (TR-PSM). B, chromatogram of TR-PSM after partial deglycosylation by TFMSA at 0°C giving TTR-PSM (see "Experimental Procedures"). C, S200 chromatography of the indicated TTR-PSM fraction in B after digestion with trypsin, yielding TTTR-PSM. The third peak, T3, represents the tryptic glycosylated PSM tandem repeat. D, S200 chromatograph of the glycosylated tryptic tandem repeat, T3, from C (Ⅺ, absorbance at 229 nm) and the chromatogram of Glu-C digested T3 (isolated on reverse phase HPLC, Fig. 3A) giving GTTTR-PSM (ࡗ, absorbance at 229 nm). nonspecific protease degradation arises from the observation that the use of non-L-1-tosylamido-2-phenylethyl chloromethyl ketone-treated trypsin gives a T3 peak that is further broadened to lower molecular weight. By eliminating these lower molecular weight species we are able to reduce the background sequencing "noise," thereby permitting the sequencing of longer segments of the tandem repeat glycopeptide (discussed below). We believe, however, that these degradation products would not have significantly different glycosylation patterns compared with the intact tandem repeat.
Glu-C Glycopeptides-The PSM tandem repeat contains three potential cleavage sites for endoproteinase Glu-C from Staphylococcus aureus V8 (cleaving at C terminus of Glu). Digestion of the tryptic PSM tandem repeat with Glu-C will therefore produce three glycopeptides of 40, 38, and 3 residues each, proceeding from the N to C terminus, respectively. We chose to isolate the 38-residue glycopeptide (residues 39 -78) since its analysis would further help confirm the glycosylation pattern of the C-terminal half of the tandem repeat.
The HPLC-purified tryptic tandem repeat (Fig. 3A) was digested with Glu-C and fractionated on S200 chromatography, Fig. 2D. As shown in the figure the Glu-C digest (ࡗ) migrates at a lower molecular weight than undigested tryptic repeat (Ⅺ). The 40-and 38-residue glycopeptides are not expected to be resolved on S200; therefore, the production of a single sharp peak after Glu-C digestion indicates the repeat has been fully cleaved by the enzyme. The remaining 3-residue glycopeptide was not specifically isolated or identified but presumably appears near the included volume of the column approximately fractions 120 -140.
On reverse phase HPLC the pooled product of the Glu-C digest (pooled as indicated in Fig. 2D, ࡗ) reveals a complex pattern comprised of two major broad peaks, labeled (GTTTR-PSM)-GII and (GTTTR-PSM)-GIII, as shown in Fig. 3B. Peak GII elutes at a lower acetonitrile content than the intact tryptic tandem repeat, whereas peak GIII elutes at a similar acetonitrile content as the intact tryptic tandem repeat. On the basis of the expected differences in hydrophobicities, the least hydrophobic fraction, GII, was tentatively assigned to the least hydrophobic glycopeptide 39 -78 (containing 11 hydrophobic residues: Ala, Pro, Val, and Ile), and the more hydrophobic fraction, GIII, was tentatively assigned to glycopeptide 1-38 (containing 15 hydrophobic residues). Proton NMR spectroscopy, at 600 MHz, of each pooled fraction confirmed their identities based on their different amino acid compositions (data not shown). Fraction GTTTR-PSM-GII was unambiguously identified as residues 39 -78 on the basis of its amino acid sequence, presented below.
Amino Acid Sequencing of PSM Tandem Repeat Glycopeptides-Amino acid sequencing was performed on the reverse phase HPLC-purified tandem repeat glycopeptides, TTTR-PSM-T3 and GTTTR-PSM-GII. As described earlier (24, 34 -35), unique elution patterns are observed for the PTH-derivatives of ␣-GalNAc-Ser and ␣-GalNAc-Thr as shown in Fig. 4A for authentic ␣-GalNAc-Ser and ␣-GalNAc-Thr. Each glycosylated PTH-derivative appears as a pair of peaks in the chromatogram (Fig. 4A) because the conversion reaction forming  Fig. 2C. B, chromatograph of the Glu-C-digested tryptic tandem (GTTTR-PSM) repeat pooled as indicated in Fig. 2D giving peaks GTTTR-PSM-GII and GTTTR-PSM-GIII. Solvent gradients for A and B were as follows: 0 min, 20% solvent B and 20 min, 50% solvent B. Solvent A, 100% water, 0.05% trifluoroacetic acid; buffer B, 50% acetonitrile, 50% water, 0.05% trifluoroacetic acid. Vertical scale represents the absorbance at 220 nm.  (Table I)  the amino acid-PTH-derivative produces diastereomers with different retention times. The ␣-GalNAc-Ser-PTH-derivatives elute as an unresolved doublet, labeled S*ϩS**, early in the gradient near the position of PTH-Asp, and the ␣-GalNAc-Thr-PTH diastereomers elute later in the gradient as resolved peaks T* and T**, near the positions of PTH-Ser and PTH-Thr, respectively. These peaks are readily identified in the sequencing chromatograms of glycopeptide TTTR-PSM-T3 for residues Ser 2 , Ser 6 , and Thr 22 as shown in Fig. 4, B-D. On the basis of the relative sizes of the glycosylated and nonglycosylated PTH-Ser/Thr derivatives in the TTTR-PSM-T3, it appears that Ser 6 and Thr 22 are more highly glycosylated than Ser 2 , which seems to be poorly glycosylated.
Sequencing chromatograms were quantified to obtain the residue-specific extent of glycosylation as illustrated in Figs. 5 and 6 for sequencing run 2 of TTTR-PSM-T3 tandem repeat glycopeptide. Fig. 5A displays the uncorrected area data for the Gly-PTH peak plotted as a function of sequence cycle. The figure shows a pronounced base-line curvature that is also observed for the other amino acid residues and glycosylated Ser/Thr (data not shown). Since the extent of base-line curvature correlated with the residue's percent mole fraction, we conclude that the curvature is due to the cumulative effects of cycle preview and lag and perhaps due to heterogeneous cleavage of the tandem repeat peptide. Since non-zero base lines will interfere with the accurate determination of the extent of glycosylation and with the sequence determination at high cycle numbers, we eliminated the curvature by the base-line subtraction approach described under "Experimental Procedures." The effectiveness of the base-line correction approach is shown in the corrected data for Gly, Ile, Val, and Ala of Fig. 5, C-F, and for glycosylated and nonglycosylated Ser and Thr of Fig. 6, A-D. Note that after base-line correction the sequence can be read well beyond residue 60 as demonstrated by the expanded plots of Fig. 5, D and F, and Fig. 6, B and D. A plot of the single residue picomoles recovered versus cycle number, Fig. 5B, indicates that we have achieved reasonable sequential residue quantification and recovery. An average apparent repetitive yield of 99% is obtained from the data in Fig. 5B. Table I lists the calculated extent of Ser/Thr glycosylation obtained from the multiple sequencing of the PSM tryptic tandem repeat (TTTR-PSM-T3) and its C-terminal Glu-C glycopeptide (GTTTR-PSM-GII). Sequence determinations 1 and 2 represent data from the same sample sequenced on different instruments. Note the excellent agreement between the two sequencing experiments. Sequence determination 3 represents FIG. 5. Representative amino acid sequencing profiles for the 81-residue tryptic PSM tandem repeat glycopeptide TTTR-PSM-T3. A, plot of the uncorrected peak area data for Gly-PTH as a function of cycle number. B, plot of the base-line corrected specific residue picomole content versus cycle number. C and D, base-line corrected picomole plots for Gly-PTH at 1 and 10 ϫ vertical scales, respectively. E and F, base-line corrected picomole plots for Ala-PTH (Ⅺ), Val-PTH (ϩ), and Ile-PTH (छ) at 1 and 10 ϫ vertical scales, respectively. Data are taken from sequence determination 2 (Table I) (Table I)  the tandem repeat obtained from a different PSM preparation. Again, the results of determination 3 are nearly indistinguishable from the results of sequence determinations 1 and 2. The glycosylation patterns of the C-terminal Glu-C tryptic tandem repeat glycopeptide, determinations 4 and 5, also are in good agreement with the data from the full tryptic tandem repeat.
An examination of sequence determinations 1-5, Table I, reveals that Ser 2 , Ser 14 , Ser 32 , Ser 47 , Ser 54 , Ser 62 , and Ser 63 are consistently the least glycosylated. As a group, 74% of the Ser residues are glycosylated compared with 83% for the Thr residues, values consistent with the 13 C NMR data analysis (data not shown, see Ref. 26). Combined, 78% of the Ser and Thr residues are glycosylated ( Table I).
The average sequence-specific glycosylation obtained from  Table I. For the predictions, plus symbols to the right of the value indicate that the residue is predicted to be glycosylated based on the original published cutoff criteria, and minus symbols indicate the residue would not be glycosylated. As evident from the table and visually when the observed versus predicted glycosylations are plotted together (Fig. 7), the predictions do not completely agree with each other nor do they correlate well with the observed glycosylation. DISCUSSION O-Linked Glycoprotein Sequencing-These studies demonstrate that the partial deglycosylation of mucin-type O-linked glycoproteins by mild TFMSA yields glycoprotein derivatives with monosaccharide ␣-GalNAc side chains that are both susceptible to protease digestion and suitable for standard amino acid sequencing. Using this approach the heterogeneously glycosylated 81-residue PSM tryptic tandem repeat has been isolated and quantitatively sequenced, revealing its glycosylation pattern.
There are numerous reports of the use of automated Edman sequencing for the semiqualitative sequencing of O-linked glycoproteins and for the determination of the in vitro glycosylation patterns of the UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase. Sites of glycosylation have typically been estimated by the presence of "blank" cycles (see for example Refs. 23,[35][36]. Other studies of the GalNAc transferase have relied on the incorporation of radiolabeled UDP-GalNAc into substrate peptide and scintillation counting of the released products after Edman sequencing (see Refs. 9,[18][19]34,37). Few workers have attempted to chromatographically characterize or quantitatively analyze the resultant glycosylated Ser and Thr PTH derivatives obtained from standard Edman sequencing. Abernethy and co-workers (34) have described and partially characterized the ␣-GalNAc-O-Thr-PTH-derivative  Fig. 3A; GTTTR-PSM-GII, protease Glu-C-cleaved TTTR-PSM-T3, residues 39 -78, purified as shown in Fig. 3B derived from the sequencing of a series of glycopeptide acceptors of the GalNAc transferase. Although they failed to demonstrate its use, these workers had suggested that a TFMSA-Edman sequencing approach would be useful for characterizing ␣-GalNAc-Thr containing glycopeptides. 3 In the laboratory of Gooley and co-workers (38 -41), an Edman sequencing protocol has been developed, using the more hydrophilic solvent trifluoroacetic acid, for the sequence analysis of O-linked glycoproteins containing intact oligosaccharides. 4 Unfortunately, for most glycoproteins with intact oligosaccharide side chains, the presence of heterogeneous oligosaccharide structures complicates the sequence analysis. Thus, quantitation of the extent of glycosylation is still a difficult task. Another drawback of this approach is that the presence of full-length oligosaccharide side chains will interfere with protease digestions making the isolation of reasonably sized glycopeptides with homogeneous peptide sequences difficult. Therefore, for several reasons, the use of mild TFMSA to trim heterogeneous O-linked oligosaccharide side chains to homogeneous peptide-linked GalNAc residues followed by standard Edman sequencing may be the most effective approach for determining the primary glycosylation pattern of heavily O-glycosylated glycoprotein domains. PSM Tandem Repeat Glycosylation Pattern-The glycosylation pattern obtained for the PSM tandem repeat reveals that all potential glycosylation sites can be glycosylated, although to different extents. Interestingly, the Ser residues display a wider range of observed glycosylations, ranging from about 30% to nearly 100%, whereas the Thr residues show a much narrower range of glycosylation, ranging between 70 and 90%. The extent of Ser O-glycosylation of the PSM tandem repeat is higher than expected based on the in vitro glycosylation studies of the isolated porcine UDP-GalNAc:polypeptide N-acetylgalactosaminyltransferase (14 -15). With this enzyme and other GalNAc transferases isolated to date (see Refs. 9,[17][18][19][20]35,[42][43], Ser residues are typically very poor substrates in in vitro glycosylation studies. In contrast, in vivo glycosylation studies reveal higher extents of O-glycosylation at both Ser and Thr (20). As expected the observed PSM glycosylation pattern is consistent with these observations. As shown in Fig. 7, none of the three most recent peptide O-glycosylation predictive approaches were capable of successfully predicting the PSM tandem repeat glycosylation pattern.
A comparison of the figure suggests that Elhammer's and Chou's (9, 11) predictions may be more useful for mucins as they correctly predicted the largest number of glycosylated residues. Unfortunately, none of the approaches performed well in predicting the poorly glycosylated residues, although all three approaches predicted the least glycosylated residue, Ser 2 , to be nonglycosylated. The inability for the predictions to reasonably predict the PSM glycosylation pattern may arise from several factors, most notably mixing the glycosylation patterns of globular proteins and mucin-like domains in constructing the algorithms, the lack of accurate and complete glycosylation data, and the possible existence of several tissue-/species-specific transferases with different substrate specificities (21)(22)(23).
Having available such a large data base in the PSM tandem repeat (i.e. a total of 31 different glycosylated Ser and Thr residues), an attempt was made to determine whether any specific patterns could be associated with a given sequence's degree of glycosylation. In Tables II and III we have listed, in order of increasing extent of glycosylation, the heptad peptide sequences for each Ser and Thr residue in the tandem repeat. In addition, due to the large number of Ser/Thr dyad sequences (9 pairs) and the presence of a single Ser triad, we have also listed their peptide sequences in Table IV for comparison. 5 An examination of the Ser peptide data (Table II) suggests several possible trends. Sequences with Gly at positions ϩ2 or Ϫ2 from the potential site of glycosylation appear to be associated with a higher degree of glycosylation; Gly at positions ϩ3 or Ϫ3 may be associated with reduced glycosylation, and Gly at positions ϩ1 or Ϫ1 apparently are glycosylation neutral as shown by their average glycosylation values at the bottom of Tables II-IV. Sequences containing Pro are usually more heavily glycosylated, with the Pro typically located at the ϩ3 or Ϫ3 positions, and sequences with high Ser/Thr contents appear to be more poorly glycosylated. No obvious correlation with the extent of glycosylation beyond the uniform presence of coil and extended secondary structures is found. The presence of large hydrophobic residues, Ile and Val, does not apparently directly correlate with the extent of glycosylation (however, their positions may be important as discussed below). These trends also appear for Thr and the dyad and triad peptide sequences (Tables III and IV), although since their ranges in glycosylation are much smaller the trends are less apparent. It is noteworthy 5 The extent of glycosylation of Thr 79 and Ser 80 was determined from a single sequencing run of the full tandem repeat and therefore may contain significant errors. We believe, however, that the reported values are reasonable since in our earlier glycopeptide isolations (24) the major collagenase glycopeptide that is isolated is the fully glycosylated 14-residue C-terminal glycopeptide Gly 68 -Arg 81 which contains glycosylated Thr 70 , Ser 73 , Thr 79 , and Ser 80 .

FIG. 7. Plots of predicted versus observed extents of glycosylation for the individual Ser and Thr residues in the PSM tryptic tandem repeat.
A, values of h (9) versus observed glycosylation (Table I) (Table I). Residues with NetOglyc values greater than 0.5 (vertical dashed line) are predicted to be O-glycosylated. C, ⌬ values (11) versus observed glycosylation (Table I). Residues with "⌬" values greater than 0 (vertical dashed line) are predicted to be glycosylated. that 7 of the 9 dyad sequences, Table IV, have at least 1 Gly residue at dyad positions ϩ1 or Ϫ1 while 4 of the sequences have Gly residues at both ϩ1 and Ϫ1 dyad positions. 6 To examine the possibility that the observed extent of glycosylation could be rationalized in terms of specific peptide conformation and structure, we built models for the peptide se-quences in Tables II-IV. An extended ␤-conformation 7 was found to best account for the nearest neighbor and penultimate positional effects of Gly. In a ␤-conformation, nearest neighbor residues will have their side chains directed away from each other on opposite sides of the peptide backbone, whereas penultimate residues (at positions ϩ2 or Ϫ2) will have their side  chains adjacent to each other on the same side of the peptide backbone. Thus, the observation that penultimate Gly residues may favor glycosylation suggests that penultimate residues with larger side chains may interfere with glycosylation. Furthermore, since the side chains of residues at positions ϩ1 and Ϫ1 would not be expected to sterically interfere with the Ser or Thr side chain, the presence or absence of residues containing bulky side chains at these positions would be expected to have little effect on glycosylation. Thus, Ser 2 , Ser 54 , and Ser 47 which have N-and C-terminal penultimate residues with side chains are poorly glycosylated, whereas Ser 66 , Ser 57 , and Ser 17 which have a single penultimate neighbor with side chains are more highly glycosylated. Ser 73 having 2 penultimate Gly residues might also be expected to be highly glycosylated; however, it is only moderately glycosylated. The reduced glycosylation of Ser 73 is consistent with the results of in vitro glycosylation studies that suggest the added flexibility of multiple Gly residues may reduce glycosylation (see Refs. 15 and 45). All of the single residue Thr sequences in Table III are highly O-glycosylated, and the ranking of Thr 52 , Thr 37 , and Thr 70 would follow the order of decreasing penultimate residue side chain size. Ser 43 and Thr 39 appear to be exceptions to the "rule" in that they contain 2 penultimate residues with side chains and are nevertheless very highly glycosylated. We suggest that other factors, such as the presence of Pro residues in their sequences, may be responsible for enhancing their glycosylation, as discussed below. The unexpectedly high glycosylation of Ser 43 may be rationalized in terms of the conformational effects of Ϫ1 Pro. Model building reveals that the Pro residue preceding Ser 43 can alter the peptide backbone conformation so that the Ϫ2 side chain would no longer be adjacent to the Ser side chain. The proposed conformational effects of Ϫ1 Pro may explain the elevated incidence for observing Pro at this position at known O-glycosylation sites (9 -10, 16, 46). Experimentally, the introduction of a Ϫ1 Pro has been shown to enhance the in vitro glycosylation of a human von Willebrand factor dodecapeptide (17) and the in vivo glycosylation of recombinant erythropoietin (47).
The glycosylation of Thr 39 is apparently enhanced by the presence of a ϩ3 Pro. Increased occurrences of Pro at positions Ϫ3 and ϩ3 are consistently observed at known O-glycosylation sites (9 -10, 16, 46), with prolines at ϩ3 having the greatest predictive power over any other residue or position (9,10). This is supported by the work of O'Connell and co-workers (16 -17), where the removal of a ϩ3 Pro was found to reduce in vitro glycosylation. Peptides with high Pro content, such as AcTPPP, furthermore, are relatively good substrates for the transferase (12)(13)48). Isolated Pro residues appear to be less prevalent at the Ϫ2 and ϩ2 positions in PSM and in the glycosylation surveys of others (9 -10, 16, 45), again suggesting the importance of having small side chains at the ϩ2 and Ϫ2 positions. For example, the peptide sequence APDTRPA, in the human mucin Muc1 tandem repeat, which contains ϩ2 and Ϫ2 Pro residues, cannot be glycosylated in vitro (18 -19, 34). Likewise, the introduction of Pro at the Ϫ2 position relative to Ser 126 in recombinant erythropoietin inhibits its in vivo glycosylation (47).
The Ser/Thr residues in the hydroxyamino acid dyad sequences are typically highly and uniformly glycosylated giving a combined average glycosylation of ϳ85% (Table IV). From the sequencing data, however, we can only estimate a range of values in which both residues of a dyad would be glycosylated; this value ranges from ϳ60 to ϳ95% depending on whether the initial glycosylation is inhibitory or stimulatory toward the second (Table IV). Consistent with the presence or absence of glycosylation enhancing neighbors Thr 79 , Ser 32 , and Ser 14 are less highly glycosylated than their dyad neighbor. The relatively high extent of Ser dyad glycosylation further suggests that the initial glycosylation of a residue in a dyad may enhance the glycosylation propensity of the second. Similar conclusions were drawn from the in vitro glycosylation studies of Wang et al. (14 -15), although the glycopeptide studies of Brockhausen and co-workers (45) suggest the reverse.
The PSM tandem repeat contains a triad sequence of serine residues, Ser 62 -Ser 64 (Table IV). Ser 62 which lacks penultimate Gly or glycosylation-enhancing Pro residues is relatively poorly glycosylated, and Ser 63 is only moderately glycosylated, perhaps due to its 2 penultimate Gly residues. Interestingly Ser 64 is only moderately glycosylated despite having an enhancing ϩ3 Pro.
Several studies show that charged residues appear to be relatively absent from O-glycosylation sites (9 -10, 16, 46). O'Connell and co-workers (17) and Nehrke and co-workers (20) have shown that the in vitro and in vivo transferase activity is highly sensitive to the specific placement of pairs of charged residues. In contrast, Thr 79 , Ser 80 , and Thr 39 in the PSM tandem repeat are highly glycosylated despite having a pair of charged residues (Arg and Glu) in their sequences (Tables III  and IV). The high glycosylation of these sequences may stem from the presence of Ϫ3 and ϩ3 Pro residues or the transferase may tolerate the specific charge distribution of these peptides.
Model of the GalNAc Transferase Peptide Binding Site-From the above steric and structural arguments a rough model of the transferase peptide binding site has been proposed to help explain several features of the transferase and lead to testable hypotheses (Fig. 8). As proposed, substrate peptide would bind in an extended ␤-like conformation. The UDP-GalNAc binding site would be in the vicinity of the Ser/Thr residue at position P0. This model is similar to that proposed by Elhammer and co-workers (9) who suggested that interactions between the substrate and transferase would occur at both the reactive Ser or Thr residue and through the binding of an extended ␤-like substrate peptide. These workers, however, found no sequence specific side chain specificity, beyond the presence of Ϯ3 Pro for residues surrounding the acceptor Ser or Thr. Peptides with extended ␤or ␤-turn conformations have also been suggested by others to be favorable substrate conformations for the transferase (10,16,46,49).
In the model, side chains at penultimate positions relative to the central Ser or Thr are proposed to sterically interfere in some manner with glycosylation. Since all of the PSM tandem repeat glycosylation sites are at least partially glycosylated in vivo, the interactions of the penultimate side chains must not completely inhibit the transfer of ␣-GalNAc, but rather reduce the glycosylation efficiency by affecting substrate binding processes and/or GalNAc transfer. Pro (and perhaps the presence of specific patterns of charged residues) appears to further modulate transferase activity, with the prevalence of ϩ3 Pro suggesting the possibility of a specific transferase binding site for Pro at this position. These latter effects would most likely operate at the level of substrate binding rather than at the GalNAc transfer step.
Because of the alternating nature of ␤-structured peptides, substrate would be expected to bind the active site in two possible orientations, each differing by the direction in which its side chains point. The ability to bind opposite sides of the peptide would readily allow for the glycosylation of dyads. To permit the glycosylation of the second residue and to maintain flexibility regarding the order of glycosylation, both the PϪ1 and Pϩ1 sites on the transferase (see Fig. 8) would be expected to be large enough to accommodate a peptide-linked ␣-GalNAc residue. To further favor dyad glycosylation we suggest that these sites may weakly bind peptide-linked ␣-GalNAc residues.
Our model of the GalNAc transferase binding site is consistent with the kinetic analysis of Wragg and co-workers (48) that supports a 2-site model for the random, non-competitive binding of peptide substrate and UDP-GalNAc. These workers suggest that peptide substrate selection may be a 2-step process, involving the initial low specificity binding of peptide followed by a second process involving the specific interaction of the hydroxyamino acid with the transferase active site. Enhancements in the binding or lifetimes at either of these steps could result in apparent increases in transferase activity. We propose that this second binding/transfer step may explain the differences in Thr and Ser acceptor activities and the higher sensitivity of Ser to peptide sequence. Thr residues, having a methyl group, may possess enhanced binding and/or optimal substrate geometry for GalNAc transfer compared with Ser residues that lack the methyl group. Therefore, for peptides with similar primary binding affinities, Thr acceptor peptides would be expected to be more efficiently glycosylated than Ser peptides. 8 8 It is worth noting that for N-glycosylation the Asn-Xaa-Thr sequon is more efficiently glycosylated compared with the Asn-Xaa-Ser sequon (50) perhaps for the same reasons presented here. to Pϩ3. The acceptor Thr residue will bind at site P0. Sites PϪ1 and Pϩ1 are proposed to be rather large to accommodate glycosylated residues (site PϪ1) or residues with large side chains (Pϩ1). Sites PϪ2 and Pϩ2 are proposed to best accommodate residues with small or no side chains. Sites PϪ3 and Pϩ3 may specifically bind Pro residues to account for the prevalence of Pro at these positions.