Inadequate Reference Datasets Biased toward Short Non-epitopes Confound B-cell Epitope Prediction*

X-ray crystallography has shown that an antibody paratope typically binds 15–22 amino acids (aa) of an epitope, of which 2–5 randomly distributed amino acids contribute most of the binding energy. In contrast, researchers typically choose for B-cell epitope mapping short peptide antigens in antibody binding assays. Furthermore, short 6–11-aa epitopes, and in particular non-epitopes, are over-represented in published B-cell epitope datasets that are commonly used for development of B-cell epitope prediction approaches from protein antigen sequences. We hypothesized that such suboptimal length peptides result in weak antibody binding and cause false-negative results. We tested the influence of peptide antigen length on antibody binding by analyzing data on more than 900 peptides used for B-cell epitope mapping of immunodominant proteins of Chlamydia spp. We demonstrate that short 7–12-aa peptides of B-cell epitopes bind antibodies poorly; thus, epitope mapping with short peptide antigens falsely classifies many B-cell epitopes as non-epitopes. We also show in published datasets of confirmed epitopes and non-epitopes a direct correlation between length of peptide antigens and antibody binding. Elimination of short, ≤11-aa epitope/non-epitope sequences improved datasets for evaluation of in silico B-cell epitope prediction. Achieving up to 86% accuracy, protein disorder tendency is the best indicator of B-cell epitope regions for chlamydial and published datasets. For B-cell epitope prediction, the most effective approach is plotting disorder of protein sequences with the IUPred-L scale, followed by antibody reactivity testing of 16–30-aa peptides from peak regions. This strategy overcomes the well known inaccuracy of in silico B-cell epitope prediction from primary protein sequences.

Knowledge of B-cell epitopes of proteins is essential in many fields of applied biomedical research, such as antibody diagnostics and therapeutics, vaccines, as well basic research. Laboratory methods for identification of such epitopes are time-consuming and labor-intensive. Hence, any reduction in the need for discovery and confirmatory wet-lab research by epitope prediction algorithms is highly desirable. Among in silico predictive methods from primary sequence information, epitope prediction algorithms are distinguished for their lack of reliability (1). This underperformance prompted us to examine current approaches to B-cell epitope prediction by use of extensive data on epitopes and confirmed non-epitope regions of the Chlamydia spp. proteome, accumulated in research on chlamydial molecular serology (2).
Recent three-dimensional antibody-antigen complex studies (3)(4)(5)(6)(7) show that about 15-22-aa 2 antigen peptide residues are structurally involved in binding of epitopes to ϳ17-aa residues in antibody complementarity-determining regions (CDRs; paratopes). Among these 15-22 structural epitope residues, about 2-5 aa, termed functional residues, contribute most of the total binding energy to antibodies (6). These functional residues lie only in a very small fraction of B-cell epitopes closely spaced to each other and embedded among the structural residues, representing the classical concept of continuous B-cell epitopes. In the vast majority (Ն90%) of B-cell epitopes, functional as well as structural residues are randomly distributed within 15-150-aa linear antigen sequences, essentially representing discontinuous epitopes. Thus, a peptide antigen can effectively bind an antibody only if it contains the majority of the functional residues, and only a small fraction of the short peptides of 4 -11 aa will contain sufficient functional residues for high affinity binding (6). Therefore, short peptide targets in B-cell epitope mapping and prediction may represent an inherent, unsolvable conundrum, because most of these short peptides, even from proven dominant epitope regions, will fail to bind antibodies strongly and therefore will give many falsenegative (non-epitope) results.
Mammalian immune systems can be forced to generate antibodies against virtually any molecule, regardless of antigen origin, by using excessive amounts of adjuvants and antigens. However, the antibody response did evolve in response to infections that generate much lower antigen exposure, thus antibodies may be preferentially directed toward proteins and peptide regions with certain biological, structural, and physiochemical properties that determine optimal epitopes. Antibody formation during an immune response to any given epitope is inherently stochastic due to the random availability of a cognate B-cell receptor within the large pool of circulating B-cells, all with different B-cell receptors generated by recombination of the immunoglobulin gene (8). Another level of stochasticity in the antibody response to any given protein is the exposure of a protein to the immune system. Wang et al. (9) report that only 4.2% of about 900 Chlamydia trachomatis (Ctr) proteins induce natural antibody responses in Ն40% of human hosts. Therefore, any peptide of the remaining 95.8% non-immunodominant proteins is unlikely to elicit antibodies, regardless of its B-cell epitope properties. Hence, for accurate evaluation of epitope prediction methods, epitope/non-epitope data should be derived from testing of known immunodominant proteins, with multiple rather than single sera to account for the stochasticity of the antibody response.
B-cell epitope prediction has been first based on various properties of individual amino acids (aa) such as hydrophilicity, hydrophobicity, solvent accessibility, flexibility, or ␤-turn propensity, and combinations thereof (10 -16). However, even the best combinations of aa propensity scales performed only marginally better than random sequence selection (1). With the availability of B-cell epitope databases, antigenicity scales (17,18) and machine learning approaches (19 -24) have been attempted, and improved prediction accuracy has been reported. Nevertheless, due to epitope redundancy (20), the predictive power may have been overestimated because these algorithms performed poorly on independent data (24). Therefore, B-cell epitope prediction algorithms must be evaluated on independent datasets that had not been used to train/develop the algorithms/scales.
The ever increasing number of solved three-dimensional protein structures has allowed the development and testing of numerous complex algorithms for prediction of physicochemical and structural properties of proteins directly from the aa sequences. Among these properties (scales), disorder tendency describes protein regions without defined three-dimensional structure that are inherently flexible, hydrophilic, solvent accessible, and thermally mobile (high B-factor) (25,26). Incidentally, all of these properties are shared with B-cell epitopes (16); thus, protein disorder tendency is a prime candidate scale for B-cell epitope prediction due to its multifaceted properties (27).
This investigation is an extension of a comprehensive study that identified immunodominant B-cell epitopes of Chlamydia spp. (2). After encountering numerous failures of in silico B-cell epitope prediction, we used the first principles established above to analyze the shortcomings of B-cell prediction methodology. Using pools of hyperimmune mouse sera, we determined epitope/non-epitope regions of immunodominant Chlamydia spp. proteins by use of long 16 -40-aa peptide antigens. These data created epitope/non-epitope datasets for accurate testing of numerous B-cell epitope prediction or aa/protein property algorithms/scales (henceforth termed scales). Subsequent testing revealed that public datasets were biased toward short epitope/non-epitope antigens, and removal of these short antigens dramatically increased prediction accuracy of most scales. We show that in general machine learning methods cannot predict epitopes with high accuracy; rather, many scales designed for prediction of protein properties, particularly disorder tendency, identify B-cell epitopes with better accuracy.

B-cell Epitope Peptide Reactivity with Anti-chlamydial
Hyperimmune Sera-Hyperimmune sera were raised in mice as described previously (2). Briefly, 9 -50 mice were challenged three times with high but non-lethal intranasal chlamydial inocula, to mimic antibody production following natural infections. Bovine sera used were obtained from animals with PCRconfirmed natural Chlamydia spp. infection (2). Peptide antigens were chemically synthesized with N-terminal biotin, captured onto streptavidin-coated microtiter plates, and incubated with hyperimmune sera. Primary antibodies were detected with horseradish peroxidase-conjugated secondary antibodies in chemiluminescent ELISA, and data were expressed as relative light units/s (rlu/s) and for ease of display were divided by 1,000 (rlu/s ϫ 10 Ϫ3 ) (2). All peptides were analyzed on white microtiter plates by use of specific positive pooled hyperimmune sera as well as negative control sera in wells coated with specific peptides and in a non-coated well. For the final background-corrected results, 150% of the background signal (mean ϩ 2 S.D.) in the non-coated well of each serum was subtracted from its specific peptide signals. To avoid false-positive results in quantitative evaluation of the reactivity of any peptide with individual mouse sera, and with bovine sera from naturally infected cattle, we used a more stringent cutoff of 10,000 rlu/s. Overall methods are described in detail by Rahman et al. (2).
B-cell Epitope/Non-epitope Datasets-These datasets are described briefly below, and a detailed description is provided in supplemental Table S1, and sequences are provided in the supplemental Appendix.
Concatenated Epitope/Non-epitope Virtual Proteins-Epitopes and non-epitopes of the Lbtope_Fixed_non_redundant and Lbtope_ Confirm datasets (24) were grouped by sequence length. Concatenated polyproteins of the sequences of each group were constructed by randomly combining all sequences. Similarly, concatenated polyproteins of Swiss-Prot sequences were constructed by randomly combining all sequences of the BCP12, BCP14, or BCP18 datasets (20).
Concatenated Virtual Proteins of 50-Amino Acid-extended Epitopes/Non-epitopes Embedded in Random Sequences-All 16 -20-aa epitopes of Lbtope_Confirm and fbcpred.pos.nr80 datasets were extended symmetrically to 50 aa with source protein sequences. These fragments were interspersed with random 150-aa Swiss-Prot sequences into a concatenated virtual polyprotein. Similar polyproteins were assembled from all 16 -33-aa non-epitopes of the Lbtope_Confirm dataset and all epitopes/non-epitopes of the Chl_18Prot and Chl_43Prot datasets. For assembly of additional non-epitope datasets, random 50-aa peptides of the Bcpreds epitope source proteins (Bcpreds_ Prot), Swiss-Prot proteins, and the C. trachomatis proteome (28) were similarly linked with 150-aa interspacing sequences into concatenated polyproteins.
B-cell Epitope/Non-epitope Annotation of Individual Chlamydia spp. Proteins-All residues of 18 immunodominant proteins of Chlamydia spp. were annotated in the Chl_18Prot dataset as Pos (positive, epitope), Neg (negative, non-epitope), or NT (not tested, unknown epitope status). The annotation is based on the reactivity of 16 -20-aa peptide antigens with murine and/or bovine sera. For peptide datasets, 10-aa-spaced peptides of these proteins were used.
Computation of Amino Acid Residue Scores for Physicochemical, Structural, and Evolutionary Protein Properties-Websitebased freeware algorithms/scales (10 -15, 17-20, 22-24, 29 -55) for protein properties were used to calculate individual residue scores for aa sequences of individual or polyproteins. Moving window scores were assigned to the center residue of the particular window. When required, missing scores for Nand C-terminal residues were inserted using scores of the adjacent residues. If algorithms/scales did not provide an output score for internal residues/windows, the minimum score of this protein was inserted. The polymorphism score was calculated by inverting the multiple sequence conservation score of the AACon algorithm in the Jalview freeware (35).

Comparison of Receiver Operating Characteristic (ROC) Curves of Protein Property Scales for B-cell Epitope Prediction-
Bimodal epitope/non-epitope classification was achieved by F test classification based on the linear predictor variable in discriminant analysis with the software package JMP Pro 11 (SAS Institute Inc., Cary, NC). This software was also used to construct ROC curves and calculation of area under the ROC curve (AUC) for ranking of protein property scales for B-cell epitope prediction. Data formatting was performed in Microsoft Office Excel 2013, and all additional statistical analyses were performed by the Statistica 7.1 software package (Statsoft, Tulsa, OK). Differences between means of peptide reactivities and/or background were analyzed by one-tailed paired Student's t test, and p values Յ 0.05 were considered significant. The significance of differences between B-cell epitope prediction scales was tested by one-tailed paired Student's t tests of AUCs of ROC analyses. If multiple independent test datasets were available, the mean AUC values of these scales for these datasets were compared. If multiple proteins of a single dataset were available, the mean AUC values for these proteins were compared. If ROC curves for different B-cell epitope prediction scales were analyzed for a single dataset, specificity was sampled in 0.05 increments for sensitivities from 0.05 to 0.85 and in 0.02 increments from 0.90 to 0.98, and the specificity means were compared, or the mean accuracy values at 40, 60, 80, 90, and 95% sensitivity were compared.

Antibody Binding Increases with Length of Peptide Antigens-
Within a peptide B-cell epitope, 15-22 residues are typically structurally involved in antibody binding (3-7). Sivalingam and Shepherd (6) reasoned that clustering or random distribution of the structural residues would determine the length of peptide antigens required for antibody binding. In this study, we tested length-dependent peptide antigen reactivity for previously identified epitope regions of the chlamydial outer membrane protein A, OmpA (2). Seven-12-aa peptide antigens invariably produced lower ELISA signals than longer ones (Fig. 1A). Occasionally, extensive elongation of peptide antigens may mask structural residues and reduce the signal relative to a slightly shorter peptide antigen of optimum length ( Fig. 1A; 32versus 24-aa Cpe peptides).
To quantify the effect of peptide length on antibody binding, peptide antigens of different lengths from 17 epitope regions of OmpA and inclusion membrane A (IncA) proteins of Chlamydia spp. were tested (Fig. 1B). Compared with short 7-12-aa peptides, intermediate 16-aa peptides produced 1.8-fold ELISA signal intensity, and long 24 -40-aa peptides produced 4.1-fold signal intensity (p value Ͻ10 Ϫ2 , one-tailed Student's t test, relative log-transformed signal). Importantly, the main 14.3-fold reactivity increase was achieved by elongation of the 13 lowest reactive short peptides from 7-12 to 24 -40 aa (p value Ͻ10 Ϫ4 ),   whereas elongation of the 13 lowest reactive intermediate 16-aa peptides from 24 to 40 aa produced a moderate 3.3-fold increase (p value Ͻ10 Ϫ4 ).
To identify optimum peptide antigen lengths, we tested the central 16-and 24 -40-aa peptide antigens of 55 unique epitopes on 28 Chlamydia spp. proteins with pooled mouse sera (Fig. 1C). As observed before, ϳ20% of elongated peptides produced a reduced signal, presumably due to epitope masking. However, long 24 -40-aa peptides produced on average a 2.1fold higher signal than the corresponding 16-aa peptides (p value Ͻ10 Ϫ4 ). The reactivity of the 28 lowest reactive 16-aa peptides increased 9.1-fold for the respective long 24 -40-aa peptides (p value Ͻ10 Ϫ4 ). To confirm the host independence of length-dependent peptide reactivity, another set of chlamydial peptides yielded equivalent results with bovine sera (Fig. 1D).
Evaluation of B-cell Epitope Prediction Algorithms Confounded by Over-representation of Short False-negative Epitopes in Public Datasets-Many investigators typically use short peptide antigens of 4 -11 aa for epitope mapping, with results added to public reference databases that are used for the development of B-cell epitope prediction algorithms/scales (10 -24). These datasets may therefore be biased toward short epitopes and many false-negative epitope determinations due to the marginal antibody binding of short peptides. This may explain the poor, close to random, epitope prediction accuracy (1) that most epitope prediction scales show in practical application, even if they scored highly in evaluation with public datasets. We hypothesized that removing short epitope/nonepitope sequences from public datasets would allow correct performance ranking of B-cell epitope prediction scales. To test this hypothesis, we used the "Lbtope_Variable_non_redundant" dataset with 8,011 B-cell epitopes and 10,868 nonepitopes, retrieved by Singh et al. (24) from experimentally validated epitopes as well as non-epitopes from the Immune Epitope Database (IEDB). Importantly, 5-10-aa non-epitopes include ϳ50% and 5-16-aa non-epitopes include ϳ80% of all non-epitopes deposited in the parent IEDB (24). Among the 6 -11-, 12-15-, 16 -20-, and 21-30-aa sequences, the Lbtope_ Variable_non_redundant dataset contains 2.12ϫ, 1.49ϫ, 0.93ϫ, and 0.54ϫ numbers of non-epitopes compared with epitopes. These data indicate that short non-epitope sequences are over-represented in the public knowledge base. For analysis shown in Fig. 2, epitopes and non-epitopes were grouped by length, and all sequences of each length group were randomly concatenated into a single virtual protein. Hydrophilicity of all consecutive non-overlapping 20-aa peptide windows of each concatenate was predicted by use of the Parker ProtScale (11) in ExPASy (29), a parameter used in B-cell epitope prediction. Results in Fig. 2A show that in the 12-15-, 16 -20-, and 21-30-aa length concatenates, the hydrophilicity scores of epitope and non-epitope virtual proteins differ highly significantly (p value Ͻ10 Ϫ6 , Student's t test). In contrast, the hydrophilicity of epitope and non-epitope virtual proteins is not different for the 6 -11-aa length concatenates (p value ϭ 0.052). Thus, for peptides longer than 11 aa, the hydrophilicity scale discriminates between epitopes and non-epitopes but not for shorter peptides.
Similarly, with values of ϳ0.50 for the AUC ROC curve, additional scales in Fig. 2B show random distribution of epitope versus non-epitope prediction for the 6 -11-aa concatenates. In contrast, these scales show highly significantly increased prediction accuracy for concatenates of peptides longer than 11 amino acids, indicating significant discrimination between epitopes and non-epitopes (Fig. 2B). Analysis of the "Lbtope_ Confirm" subset, composed of epitopes/non-epitopes that were at least twice independently experimentally validated, con- FIGURE 2. B-cell epitope prediction score and performance in dependence on peptide length. A, hydrophilicity scores of epitopes and nonepitopes are grouped by length in the Lbtope_Variable_non_redundant dataset (24). Hydrophilicity (Parker) (11) scores were obtained by using default settings in the ProtScale tools of the ExPASy server (29). Length-dependent hydrophilicity (Ϯ95% CI) of epitopes and non-epitopes is shown in green and red, respectively, and the p values for differences are shown in green. B, epitope length-dependent prediction performance (area under receiver operating characteristic curve) of different prediction scales in the Lbtope_Variable_non_redundant dataset. ***, p value Ͻ10 Ϫ6 for comparison of any scale to any other scale of 6 -11-aa epitopes versus longer epitopes. Hydrophobicity (Miyazawa,30), a ProtScale (29) for hydrophobicity; Bepipred, a hidden Markov model combined with the Parker hydrophilicity scale (19); IUPred-L, an algorithm for protein disorder tendency (31). C, all 6 -20-aa epitopes and non-epitopes of the Lbtope_Confirm dataset (24) grouped into 6 -11-, 12-15-, and 16 -20-aa peptides are compared with Swiss-Prot 12-, 14-, and 18-aa peptides of the Bcpreds BCP12, BCP14, and BCP18 datasets (20). The p value for hydrophilicity score differences between epitopes and non-epitopes is shown in green and between non-epitopes and Swiss-Prot random peptides in blue. All epitopes have higher hydrophilicity scores than Swiss-Prot random peptides (p value Յ 10 Ϫ5 ).
firmed the result of the Lbtope_Variable_non_redundant dataset ( Fig. 2C). An additional finding is that the 16 -20-aa nonepitopes do not have a significantly higher score than the random Swiss-Prot peptides, although the 6 -11-and 12-15-aa non-epitopes do (Fig. 2C). This suggests that the Lbtope_Confirm dataset may have a high frequency of incorrect identification of short 6 -15-aa peptides, particularly 6 -11-aa, as non-epitopes.

Evaluation Accuracy for B-cell Epitope Prediction Depends on Epitope/Non-epitope Discrimination in Test Datasets-Ideal
algorithms/scales for prediction of B-cell epitopes should discriminate known epitopes from experimentally validated nonepitopes and identify epitopes within complete source proteins and proteomes. Since prediction accuracy should ideally be validated with multiple datasets, we evaluated the prediction performance by use of four positive datasets of experimentally validated B-cell epitopes and five negative datasets of experimentally validated non-epitopes or of random peptides from proteomes ( Table 1). All 16 -20-aa epitope/non-epitope sequences of the datasets were centered within their 50-aa source protein sequences, and these 50-aa sequences were randomly concatenated into a single virtual protein, interspersed with random 150-aa sequences from the Swiss-Prot database (supplemental Table S1 and Appendix). For evaluation of B-cell epitope prediction, we used the original score of each algorithm with default settings for each amino acid residue, thus each epitope/non-epitope sequence received individual scores for the 20 central residues. Correct or incorrect epitope prediction of all residues was evaluated by AUC in ROC analysis.
In Table 1, the column for each scale indicates epitope versus non-epitope discrimination (AUC in ROC analysis) of the scale for each compared combination of positive and negative datasets. The average AUC column, next to the rightmost column, indicates epitope/non-epitope discrimination averaged over all tested scales and therefore ranks the combined discrimination in both positive and negative datasets. The four positive datasets can be ranked by their AUC in comparison with the negative Swiss-Prot or Ctr-Proteome datasets, clearly showing significantly higher discrimination for both chlamydial datasets than for the fBcpreds and Lbtope_Confirm datasets (p value Յ0.02, one-tailed paired Student's t test).
The average AUC row, next to the bottom row in Table 1, indicates epitope/non-epitope discrimination averaged over the 12 tested pairs of datasets and shows that disorder tendency

-cell epitope prediction accuracy (AUC of ROC curves) in dependence on the evaluation dataset
a Datasets with experimentally identified epitopes/non-epitopes or random peptides are shown (see supplemental Table S1). b AUC data are shown for the best performing scale of each category as defined for all scales tested (see supplemental Tables S2 and S3). c Antigenicity scale (17), IEDB tool for antibody epitope prediction. d Support vector machine model (SVM) trained on the Lbtope_Confirm dataset (24). e Surface accessibility scale (13), IEDB tool for epitope prediction. f ␤-turn scale (32), IEDB tool for epitope prediction. g Average of seven propensity scales for epitope prediction (15). h Flexibility scale (12), IEDB tool for epitope prediction. i Quality of scales or datasets was ranked by AUC, with rank number determined by 1 for the highest AUC and addition of 1 for each 0.01 AUC reduction; antigenicity and Lbtope scales were excluded from ranking because antigenicity is a negative predictor and Lbtope was trained on the analysis dataset. j Not applicable for quality ranking because the datasets served to train the Lbtope support vector machine model. k Highest AUC in the compared dataset.
discriminated best of all algorithms tested (AUC ϭ 0.75 for IUPred-L [31]), highly significantly better than Bepipred (AUC ϭ 0.70), the next-best algorithm in the 12 AUC comparisons of positive and negative datasets (p value Ͻ10 Ϫ3 , onetailed paired Student's t test). Since the machine learning Lbtope algorithm was trained on the Lbtope_Confirm dataset, it performed extremely well in this dataset (AUC ϭ 0.97, 0.84, and 0.81), but very poorly, close to randomization, in all other datasets (average AUC ϭ 0.57). In contrast, the protein disorder scale IUPred-L consistently discriminated best (Table 1), with AUCs depending on the datasets (0.58 -0.88).
Wide Amino Acid Context and Standardized Scoring Maximize B-cell Epitope Prediction Accuracy-Most amino acid propensity and B-cell prediction scales score the context-dependent epitope likelihood for each amino acid residue by averaging the adjacent Ϯ4 residues (10 -19). In contrast, protein disorder prediction operates in a wider sequence context (31). To simulate the wider sequence context, we asked the question if scores for long peptides improved prediction accuracy of narrow-context scales and, if so, what the peptide length dependence of such an improvement was. In Table 2, we determined B-cell epitope prediction accuracy for single scores for the central 1-aa epitope/non-epitope residue and contrasted it to prediction by single average scores for the central 10-, 15-, 20-, 25-, or 30-aa residues. The mean AUC values of all 12 comparisons of four to five non-epitope datasets (as in Table 1) are shown in Table 2. The results indicate optimal B-cell epitope prediction for 20 -30-aa peptide scores. Long peptide scoring improves prediction substantially for narrow-context scales such as hydrophilicity, hydrophobicity, flexibility, or Bepipred but not for wide-context protein disorder tendency scales such as IUPRed-L and VSL2B. Table 2 shows prediction of epitopes/non-epitopes embedded in virtual concatenated polyproteins. In this approach, dif-ferences between the widely divergent average scores of source proteins cannot be offset by standardization (i.e. set to mean ϭ 0 and S.D. ϭ 1). To eliminate B-cell prediction bias induced by selection of epitope source proteins, we analyzed scores standardized for each protein of the 18 chlamydial protein datasets (Chl_18Prot; supplemental Table S1). Table 3 shows the results for comparison of experimentally validated epitopes of these proteins to experimentally validated non-epitopes (Pos versus Neg) or the total remaining protein (Pos versus Neg ϩ NT). Standardization substantially improves B-cell epitope prediction accuracy for disorder and solvent accessibility scales, but not for individual amino acid propensity scales such as hydrophilicity or hydrophobicity. Similar to results for concatenated polyproteins (Table 2), standardized scores of long peptides improve performance of narrow-context but not of wide-context scales ( Table 3).
The polymorphism scale for the 18 chlamydial protein dataset was derived by inverting the Jalview AACon conservation score (35) calculated from multiple sequence alignments of these proteins with the available homologous Chlamydia sequences. Thus, it is completely independent of individual amino acid properties and quantifies only evolutionary sequence change at each residue. Standardization of polymorphism scores in Table 3 improves prediction because chlamydial proteins have widely divergent rates of evolution (2). Importantly, averaging over 25-aa residues again provides maximum prediction accuracy, suggesting that wide-context properties in general are the best predictors of B-cell epitopes from their primary amino acid sequences.
Protein Disorder Most Accurately Predicts B-cell Epitopes-For epitope/non-epitope discrimination in the ROC curve in Fig. 3A, sensitivity at given specificity (or specificity at given sensitivity) of the IUPred-L disorder or the combined scale is higher than that of Bepipred (19) or LBTope (24) (p value from the maximum (bold red font) are shown in red font. b A single score of the central residue was considered for each peptide in the datasets shown in Table 1. c Scores of the central 10/15/20/25/30aa residues were averaged to a single peptide score. d Comparison of 1-aa (single residue) versus 10-aa peptide scoring. e Comparison of 10-aa versus 25-aa peptide scoring. F/f AUC significantly higher for 10-aa peptide scoring than for 1-aa scoring ( F , 10 Ϫ6 Յ p value Ͻ 10 Ϫ3 ; 10 Ϫ3 Յ p value Յ 10 Ϫ2 ; one tailed paired Student's t-test with 12 AUC values). G/g AUC significantly higher for 25-aa peptide scoring than for 10-aa peptide scoring ( G , 10 Ϫ4 Յ p value Ͻ10 Ϫ3 ; 10 -3 Յ P-value Յ 0.023).
Ͻ10 Ϫ4 , one-tailed paired Student's t test). Similarly, when epitopes were discriminated from the complete remaining proteins in Fig. 3B, IUPred-L scale performed significantly better than Bepipred or LBTope. In final testing of the overall prediction approach applied to the 18 individual proteins of the Chl_18Prot dataset, IUPred-L scale also best discriminated individual epitope residues from the whole remaining protein (average AUC of IUPred-L ϭ 0.91, minimum ϭ 0.74, maximum ϭ 1.00, S.D. ϭ 0.08; Table 4). Tables 1 and  2, most epitope/non-epitope sequences were derived from public datasets of variable and largely unknown discrimination accuracy. For maximum accuracy, we therefore selected the 18 chlamydial protein datasets (Chl_18Prot; supplemental Table  S1) with extensively validated epitopes as well as non-epitopes on each protein, all identified in a single investigation (2). For the Chl_18Prot dataset, 151 standardized primary scales for B-cell epitope prediction were evaluated (supplemental Tables  S2 and S3). To improve B-cell epitope prediction, investigators frequently combine scales (16). To test this concept, we evaluated 126 combined scales that were derived by linear combination of 2-14 standardized primary scales ( Fig. 4 and Tables S2 and S3). In Fig. 4, we asked whether the combined scales, derived from 25-aa moving averages of the primary scales, improve B-cell epitope prediction. Results show that the combination of scales only incrementally improves B-cell epitope prediction (Fig. 4). The best combination of the primary scales provides only a 3.8% improvement of prediction accuracy over IUPred-L protein disorder in five tests at 40, 60, 80, 90, and 95% sensitivities (Fig. 4C, p value Ͻ0.049, paired Student's t test, with five accuracy values). Collectively, the dominant conclusion is that the main improvement for B-cell epitope prediction  References 11,19,24,30,31,[33][34][35] cited in the table. a The Chl-18Prot dataset was analyzed. Pos, Positive (epitopes); Neg, negative (non-epitope); NT, not tested (epitope or non-epitope status is unknown). Pos versus Neg indicates epitopes were compared to non-epitopes; and Pos vs NegϩNT indicates epitopes were compared to the total remaining protein.

Standardization of individual protein scores improves B-cell epitope prediction
Average AUC values that differ by Ͻ0.01 from the maximum (bold red font) are shown in red font. b Solvent accessibility (ASA_Spine-X) residue solvent accessibility (34); polymorphism is sequence divergence in multiple sequence alignment, calculated by inverting the conservation score of AACon in the Jalview freeware (35). c Original non-standardized score for central 1-aa residue in the peptides. These scores were obtained with individual protein sequences as input. d Original scores were standardized (mean ϭ 0 and S.D. ϭ 1) for each of the 18 chlamydial proteins, and the standardized score for the central 1-aa residue in the peptides is shown. e Difference in AUC values between standardized and non-standardized scores. f Sensitivity at a given specificity is significantly higher in ROC curves for standardized versus non-standardized scores ( f , 10 Ϫ6 Յ p valueՅ 0.01; one-tailed paired Student's t test). g Peptide scores were calculated using the average of standardized scores for the central 5-, 9-, 17-, 25-, 33-, 41-or 49-aa residues. h Difference in AUC values between standardized scores of 25-and 9-aa peptides. i Sensitivity at given specificity is significantly higher in ROC curves for 25-aa versus 9-aa standardized scores ( i , 10 -6 Յ p value Յ0.01; one-tailed paired Student's t test). JULY 8, 2016 • VOLUME 291 • NUMBER 28

JOURNAL OF BIOLOGICAL CHEMISTRY 14591
comes from using the optimal IUPred-L primary scale (Tables  3, supplemental Tables S2 and S3, and Fig. 4).
Underperformance of Machine-learning B-cell Prediction Algorithms-In evaluation of B-cell epitope prediction algorithms, scores of most physicochemical, structural, and evolutionary protein properties are higher than those of machine learning algorithms (Fig. 5). In addition, the discrimination power of all scales is higher when epitopes are tested against the remaining protein than against experimentally validated nonepitopes (Table 3 and Fig. 5A). As a consequence, the prediction performance against the remaining total protein sequences is also consistently higher for all scales. An explanation for this counterintuitive observation is that non-epitopes had initially been selected as candidate epitopes by high scores in prediction scales ( Fig. 2) but failed to react with antibodies. The higher scores for tested non-epitopes thus induced a pre-selection bias that makes evaluation of B-cell epitope prediction scales more difficult. Fig. 5B compares B-cell epitope prediction scales that were among the best combinations of the primary scales in our study with several publicly available algorithms/scales that almost uniformly perform poorly. This poor discriminatory power of machine learning algorithms most likely results from suboptimal training datasets with an over-representation of short nonepitopes. For example, Lbtope was trained on 80% short 6 -16-aa confirmed non-epitopes. In contrast, Bcpreds was trained by use of random Swiss-Prot peptides as non-epitopes, equal in length to confirmed epitopes, and they performed better than Lbtope (0.06 -0.10 AUC value difference between Bcpreds and Lbtope; Fig. 5B). Among the published combined B-cell epitope prediction scales, only Bepipred showed acceptable performance, better than the accurate Parker hydrophilicity scale (0.06 -0.09 ⌬AUC compared with Parker hydrophilicity; see Table 3). Bepipred nevertheless requires long peptide scores for optimal performance (0.07-0.10 ⌬AUC between 25-aa peptide and default scoring; Table 3), and it is not a pure machine learning algorithm because it combines a protein property scale, Parker hydrophilicity with a hidden Markov model (19).

Dominant Properties of B-cell Epitope
Regions-Our evaluation of the discriminatory power of B-cell prediction algorithms in the extensively experimentally confirmed Chl_18Prot dataset allowed us to deduce some critical global properties that define natural B-cell epitope regions. Clearly, the dominant property is the propensity for a disordered state of amino acids in B-cell epitopes. This property is linearly correlated to hydrophilicity (inverted Miyazawa hydrophobicity scale (30); R 2 ϭ 0.66, p value Ͻ10 Ϫ6 ; linear regression analysis of the 25-aa peptide scores centered around each residue of the Chl_18Prot dataset), flexibility (Karplus and Schulz (12); R 2 ϭ 0.56, p value Ͻ10 Ϫ6 ; solvent accessibility (Spine-X (34); R 2 ϭ 0.49, p value Ͻ10 Ϫ6 ), evolutionary mutation rate (R 2 ϭ 0.43, p value Ͻ10 Ϫ6 ), coils in secondary structure (PSIPRED (52); R 2 ϭ 0.42, p value Ͻ10 Ϫ6 ), and ␤-turns (Levitt (54); R 2 ϭ 0.40, p value Ͻ10 Ϫ6 ). Thus, due to multi-collinearity, the multifaceted properties of protein disorder tendency synthesizes all of these properties into a single descriptor (Fig. 5). The physicochemical, structural, and evolutionary properties of B-cell epitope regions discriminate them sufficiently to translate into significant differences in amino acid composition to the remaining total proteins. B-cell epitopes are enriched for proline, followed by glutamic and aspartic acids, asparagine, threonine, alanine, and serine (Fig. 5C). Epitopes are also relatively depleted of leucine, isoleucine, tryptophan, phenylalanine, tyrosine, and cysteine.
Proposed B-cell Epitope Prediction-As a result of the preceding analyses, an easily implemented approach for accurate B-cell epitope prediction has emerged that should be useful for investigators in many fields of antibody research. Fig. 6 demonstrates the application of the previous findings for B-cell epitope prediction in an actual example for which we generated epitope scanning data of the complete chlamydial protein IncA. In Fig. 6A, the default IUPred-L and VSL2B disorder scores of  Table 3). Plots of epitope-positive rate versus false-positive rate for the 18 chlamydial protein dataset are shown. A, prediction of epitopes from confirmed non-epitopes (25-aa epitopes/non-epitopes spaced 10 aa). The combined scale represents the arithmetic mean of two disorder scales, IUPred-L (31) and VSL2B (33), and one solvent accessibility scale, Accessible Surface Area, Spine-X (34). B, prediction of epitopes from the total remaining proteins (non-epitope plus non-tested regions). In both datasets (A and B), the combined scale and the single disorder (IUPred-L) scale performed best (highest sensitivity at given specificity or vice versa), significantly better than Bepipred or LBTope (one-tailed paired Student's t test, p value Ͻ10 Ϫ4 ).

TABLE 4 Epitope prediction accuracy (AUC) averaged for individual proteins of the 18-chlamydial protein dataset a
a Original scores obtained with default options for the algorithm/scale were smoothed by a sliding window method in which the score for each residue was averaged for the adjacent Ϯ 12 residues (25-aa moving window). Smoothed scores of residues were standardized for each of the 18 chlamydial proteins and discrimination of epitope residues from remaining total residues was tested for each of the 18 proteins individually. b Coils (Spine-X) indicate the coils predicted in secondary structure (36). the IncA protein are plotted against IncA residue number. Solvent accessibility and hydrophilicity are shown in Fig. 6, B and C. In Fig. 6D, noise was reduced by smoothing the scores as 25-aa moving averages, and comparison was improved by standardizing the data. Comparison of these IncA epitope prediction plots with actual IncA peptide reactivity in Fig. 6E clearly shows that IUPred-L predicted scores best match experimental observations and confirm the superior B-cell epitope discriminatory power of protein disorder tendency as calculated by IUPred-L. Fig. 6E displays optimal prediction approaches by the combined scores of scales shown in Fig. 6D, and smoothed and original default IUPred-L scores. While combined scores have marginally better discriminatory power (Fig. 5), for practical purposes we consider the accuracy of IUPred-L sufficient. Also, given the wide-context nature of protein disorder scales, scores are sufficiently stable to even render smoothing unnecessary, allowing direct use of default plots obtained from the IUPred-L webserver for B-cell epitope prediction. Therefore, 16 -30-aa peptide antigens for laboratory testing can be selected directly from peak disorder regions of the IUPred-L plot.

Discussion
The results of this study suggest strategies for B-cell epitope identification that deviate from current approaches that many investigators use. Our approach initially identifies protein regions that harbor B-cell epitopes rather than immediately focusing on identifying peptide antigens of specified length. B-cell epitope regions can be predicted with high accuracy simply by selection of the peak regions from the IUPred-L disorder plot (31) of a protein antigen (Fig. 6E). Next, these high probability epitope regions should be confirmed with 16 -30-aa-long peptide antigens using pooled antisera. Fine mapping of highly reactive regions with overlapping 16-aa peptides, using the individually reactive antisera of the pool, identifies regions with several functional aa residues embedded among structural epitope residues (6). Further reduction in peptide antigen length entails mapping with very short 6 -12-aa peptides. Success at this stage relies on stochastic identification (Fig. 2) of closely spaced randomly distributed functional residues that maximally contribute to antibody binding. Antibody binding of such short peptides is, however, typically low (Fig. 1), most likely because antigens of less than 16 aa will not bind to the complete CDR of an antibody (3)(4)(5)(6)(7).
This approach is derived from the conclusive evidence that short 7-12-aa peptide antigens of confirmed Chlamydia spp. epitopes bind antibodies poorly (Fig. 1), and therefore many of these epitopes would be falsely classified as non-epitopes if they were identified by short peptide mapping. The poor reactivity of short peptide antigens combined with data in Fig. 2 strongly suggest that many of the short non-epitopes in public B-cell epitope datasets are likely to be actual epitopes. Most investigators who develop B-cell epitope prediction algorithms/scales draw training and test datasets from public databases such as IEDB. These reference datasets are suboptimal due to over-
untrained protein property scales, such as protein disorder tendency, that were developed for different reasons nevertheless predict B-cell epitopes with higher accuracy than specifically developed B-cell epitope prediction scales/machine learning algorithms (supplemental Tables S2 and S3 and Fig. 5B).
A fundamental conundrum in B-cell epitope prediction is the conceptual and methodological approach that leads to the eventual identification of a B-cell epitope. Vastly preferable is the use of x-ray crystallography-solved three-dimensional structures of antigen-antibody complexes. Such data define precisely the actual determinants of a protein antigen that specifically contact an antibody, in essence the set of protein resi-dues that are buried under a cognate antibody in the antibodyantigen complex (3)(4)(5)(6)(7). However, only 26 -107 non-identical three-dimensional structures of antigen-antibody complexes have been generated by different investigators from the Protein Data Bank crystallographic database (3-7, 56 -62). Such data were used for training and development of several B-cell epitope prediction methods such as CEP, DiscoTope, Rapberger's method, Ellipro, PEPITO, and Epitopia (56 -62). The major shortcoming is the requirement for the three-dimensional structure of the protein antigen. In practice, this limitation is currently insurmountable because we do not know the threedimensional structure of most proteins.

Datasets Confound B-cell Epitope Prediction
In practice, B-cell epitopes are commonly determined by use of peptide antigens and their ability to capture antibodies. This approach does not identify which residues of the peptide are in binding contact with antibody CDR residues and which actually contribute to the antigen-antibody complex formation. Nonetheless, antibody-reactive peptide sequences, particularly those identified by systematic mapping with overlapping peptides, are commonly referred to as B-cell epitopes (16,(63)(64)(65). This terminology is justified because even non-binding residues are specifically required to provide the structural context for binding residues, thus the linear peptide sequence is still an indispensable, if not complete, characterization of a B-cell epitope. It is important, however, to understand that epitope prediction from linear peptide sequences will weigh the total combined contributions of binding (functional) and spacer (structural) amino acids to an epitope. Nevertheless, B-cell epitope prediction from the primary amino acid sequence of a protein is a valid and, for practical purposes, highly desirable approach. In addition, tens of thousands of B-cell epitope/non-epitope sequences have been deposited in IEDB (24). Thus, the sequence-based B-cell epitope datasets provide a viable basis for training and development of B-cell epitope prediction algorithms (10 -22).
The profound conundrum for epitope prediction by use of linear peptide sequence-based methods, however, is the fact that more than 90% of all B-cell epitopes are not linear, composed of immediately neighboring binding residues, but they are discontinuous. In almost all B-cell epitopes, the typical 2-5 dominant binding residues will be discontinuously arranged randomly in the linear epitope sequence (3,6,7). Nevertheless, in the majority of epitopes these binding residues are still closely spaced. For instance, Sivalingam and Shepherd (6) show that 30-aa peptides will encompass the functional residues of 75% of all B-cell epitopes. Thus, increased lengths of peptide antigens will increase the probability of capturing more of the residues of any epitope that are required for high affinity antibody binding (Fig. 1). In addition, long peptides may increase the probability of capturing different antibody clones that may bind the same epitope region differently (65). For instance, C. trachomatis OmpA serovar-specific peptide serology has used 6 -10-aa peptides, with inconsistent results (66 -70). In our study, we observed strong but completely serovar-specific antibody reactivity by use of Ն16-aa peptide antigens (2). Importantly, inclusion of conserved adjacent residues shared among chlamydial species, in addition to the 7-10 central polymorphic serovar-determinant OmpA residues, was required for strong, yet specific, antibody binding (2).
Conceptually, a peptide antigen captures antibodies if it can fold to complement the binding region of the cognate antibody (65). Because of such structural constraints, the length of peptide antigens may also negatively influence antibody binding. For instance, if the few randomly spaced dominant binding residues are obstructed by structural constraints such as misfolding, masking by non-epitope residues, or peptide aggregation (65), antibody binding may be compromised. Our study clearly shows that moderate elongation of peptide antigens strongly enhances antibody binding, while more extensive elongation reduces antibody binding again in 20% of B-cell epitopes, pre-sumably by masking epitope residues (Fig. 1). The implication of this fact is that an optimal sequence length exists that most reliably discriminates between true epitopes and non-epitopes and that sequences of that length should be used to generate datasets for the development of B-cell prediction methods.
A protein surface can be thought of as a continuous landscape of epitopic regions, and any region of this landscape may be identified as an epitope under specific conditions (56,(63)(64)(65). For instance, Singh et al. (24) reported that all nonepitopes in the LBTope_Confirm dataset have been reported as "non-epitopes" in at least two studies. Yet 8.3% of these nonepitopes are reported as "epitopes" in the fBcpreds dataset (21). Thus, binary classification of antigen regions into epitopes or non-epitopes is problematic because all epitopes of most antigens are not known, and defining B-cell epitopes and nonepitopes is a challenging task due to the variability in epitope discovery assays (71) and the stochastic antibody responses to protein antigens (9) and their epitopes (2). Muller et al. (71) found almost the entire histone 2A protein antigenic when they forced highest B-cell stimulation and antibody reactivity by excessive use of adjuvants and high antigen doses. In contrast, raising antisera in our study by experimental infection rather than by forced immunization very likely resulted in much lower adjuvantation and lower antigenic stimulus by physiologically processed native protein antigens (2). Thus, antibodies likely were generated mainly against exposed antibody-binding regions of highly expressed proteins. In addition, targeting known immunodominant proteins by the use of antisera pooled from multiple individuals maximized correct epitope/nonepitope discrimination by offsetting the inherent stochasticity of antibody formation in individuals and by minimizing falsenegative results. We observed a clear trend that certain protein regions are a more frequent source of B-cell epitopes than others, and we think that our study identified the distinctive properties of such preferentially antibody-recognized regions.
Kringelum et al. (7) determined by x-ray crystallography that hydrophobic amino acids of epitopes located closest to the antibody, and charged amino acids most distant, but that the amino acid composition of equally surface-exposed non-epitopes did not differ significantly from epitopes. However, the amino acid composition of epitopes deviated significantly from the whole protein (7). We compared properties of B-cell epitope regions with experimentally confirmed non-epitope regions or the remaining protein regions, but we do not know about surface exposure. Similarly, we report that many protein properties of B-cell epitope regions differ substantially from the total remaining proteins (Fig. 5), making these properties candidates for B-cell epitope prediction. Accessibility of the antigen by cognate B-cell receptors or antibodies is the central concept in molecular recognition of epitope by the paratope, and thus highly surface-exposed hydrophilic/charged epitope residues will first interact with the antibody (Fig. 5C). Although hydrophobic amino acids except for alanine, the smallest one, are under-represented in epitopes, those that are present may "provide the glue" in the final stabilization of the antigen-antibody complex by hydrophobic interaction. All non-covalent antigenantibody interactions are thought to be driven by shape complementarities in the complex formation (57). Thus, paratopes may interact preferentially with flexible regions of an antigen rather than with highly structured regions (72). Precisely because of relaxed structural constraints, such protein regions should accommodate higher amino acid substitution rates, favoring under immunoselective pressure the emergence of escape mutants.
Important antigenic regions of viral and bacterial proteins have been identified as disordered regions of these protein antigens (72). However, x-ray crystallography studies have not specifically reported the localization of B-cell epitopes in disordered protein regions (3-7, 56 -65). The most likely explanation for this discrepancy to our results is the fact that the Protein Data Bank database is biased toward proteins of common interest that are easy to produce and crystallize. Many expressed proteins cannot be crystallized, and among the main factors for this failure is the presence of even small numbers (1-10 aa) of disordered residues that are well known to have deleterious effects on crystallization (76,77). For convenient determination of three-dimensional structures, disordered protein regions are removed from expressed proteins (78). As a result, disordered proteins or protein regions are rare in the Protein Data Bank database compared with whole proteomes (79 -81). Moreover, crystal packing is thought to enforce certain disordered regions to become ordered (31), resulting in incorrect characterization of disordered protein residues. In addition, disordered segments crystallized together with binding antibodies are usually classified as ordered structure in the antigen-antibody complex, despite their lack of ordered structure in the unbound state. Thus, datasets generated by crystallography may inherently under-represent B-cell epitopes with high disorder tendency.
Protein disorder tendency has also not been proposed for B-cell epitope prediction from primary amino acid sequences, although many protein property scales, particularly aa propensity scales, have been tested and recommended for B-cell epitope prediction (16,(63)(64)(65). In our study, the IUPred-L dis-order scale has the highest epitope discriminatory power in all datasets. We explain this discrepancy by the typical experimental approach with which investigators test sequence-based epitope prediction methods as follows: wide-context disorder properties of proteins will not be correctly determined by solely analyzing the typically short peptide sequences of databases. To achieve correct results, we elongated test peptides with source protein sequences and embedded them in a wider context of random Swiss-Prot sequences (supplemental Table S1 and supplemental Appendix).
As a norm in investigations addressing protein disorder, protein residues are binary-classified as either "ordered" or "disordered." In contrast, disorder prediction algorithms quantify the probability of protein disorder, and binary classification converts the prediction scores by using an arbitrary cutoff at a predetermined threshold. By these criteria, many epitopes would not classify as disordered. However, relative to the moving average score of the whole source protein, B-cell epitopes consistently score highest for protein disorder tendency. For actual B-cell epitope prediction, the IUPred-L protein disorder scale consistently performs best (87% specificity at 80% sensitivity, 86% accuracy; Fig. 4). However, if a 25-aa moving average score is used, several other protein property scales such as hydrophilicity (Parker), hydrophobicity (Miyazawa), solvent accessibility (Spine-X), or Bepipred perform similarly. In fact, scoring by narrow-context scales for long 20 -30-aa peptides reflects protein disorder tendency such as the Globplot-2 algorithm predicts protein disorder tendency by a wide-context hydrophilicity score (38). It is noteworthy that even the best combination of top performing scales does not substantially increase prediction performance (Fig. 4), due to multi-collinearity of these scales. The best performing combined scale (Figs. 4 and 5 and supplemental Tables S2 and S3), derived from smoothed and standardized 25-aa peptide scores of three primary scales, improves prediction accuracy only marginally (90 -92% specificity at 80% sensitivity, 88 -90% accuracy; Fig. 4).
Our data show that wide-context disorder scores or long 20 -30-aa peptide scores of narrow-context scales are optimal for B-cell epitope prediction (Tables 2 and 3), consistent with the higher antibody binding of 16 -30-aa peptide antigens (Fig.  1). Compared with highly structured protein regions, disordered regions may have several functional advantages for efficient interactions with partner molecules (26,27), such as the capacity of initiating binding by long range electrostatic interactions, high flexibility, binding plasticity and speed, minimal steric restrictions in binding, and the ability to form very stable intertwined complexes (26,27,(82)(83)(84)(85)(86)(87). Hence, our investigation merges theoretical advances in protein biophysics with very practical aspects of protein interaction, the identification of peptide sequences best suited for recognition by CDRs of antibodies.
Author Contributions-K. S. R. and B. K. planned the experiments; K. S. R. and E. U. C. performed the experiments; K. S. R. and B. K. analyzed the data; K. S. R., B. K., and K. S. contributed reagents and essential material; and K. S. R. and B. K. wrote the paper.