New analysis pipeline for high-throughput domain-peptide affinity experiments improves SH2 interaction data

Protein domain interactions with short linear peptides, such as Src homology 2 (SH2) domain interactions with phosphotyrosine-containing peptide motifs (pTyr), are ubiquitous and important to many biochemical processes of the cell. The desire to map and quantify these interactions has resulted in the development of high-throughput (HTP) quantitative measurement techniques, such as microarray or fluorescence polarization assays. For example, in the last 15 years, experiments have progressed from measuring single interactions to covering 500,000 of the 5.5 million possible SH2-pTyr interactions in the human proteome. However, high variability in affinity measurements and disagreements about positive interactions between published datasets led us to re-evaluate the analysis methods and raw data of published SH2-pTyr HTP experiments. We identified several opportunities for improving the identification of positive and negative interactions, and the accuracy of affinity measurements. We implemented model fitting techniques that are more statistically appropriate for the non-linear SH2-pTyr interaction data. We developed a novel method to account for protein concentration errors due to impurities and degradation, as well as addressing protein inactivity and aggregation. Our revised analysis increases reported affinity accuracy, reduces the false negative rate, and results in an increase in useful data due to the addition of reliable true negative results. We demonstrate improvement in classification of binding vs non-binding when using machine learning techniques, suggesting improved coherence in the reanalyzed datasets. We present revised SH2-pTyr affinity results, and propose a new analysis pipeline for future HTP measurements of domain-peptide interactions.


Introduction
Protein domain interactions with short linear peptides are found in many biochemical processes of the cell, and play a central role in cell physiology and communication. For example, SH2 domains are central to pTyr signaling networks, which control cell development, migration, and apoptosis (1). The 120 human SH2 domains are considered "readers", since they read the presence of tyrosine phosphorylation by binding specifically to certain phosphorylated amino acid sequences. Approximately half of the binding energy of the SH2-pTyr sequence interaction is due to an invariant arginine which creates a salt bridge with the ligand pTyr. The remainder of the binding energy results from interactions between the SH2 domain binding pocket and the residues flanking central pTyr residues (2)(3)(4), resulting in specificity of SH2 domain interactions critical to pTyr-mediated signaling (5). Measurements of all SH2 binding affinities for target peptides would greatly aid in the decryption of domain specificity and advance understanding of cell signaling networks that control human physiology. However, the total potential number of interactions is immense -the 46,000 tyrosines currently known to be phosphorylated in the human proteome (6) have the potential to interact with 120 human SH2 domains, resulting in over 5.5 million possible SH2-pTyr interactions.
Recent developments have expanded the measurement coverage of human SH2-pTyr interactions. Eight high-throughput (HTP) studies have been performed to measure SH2 domain interactions with specific phosphopeptide sequences (7)(8)(9)(10)(11)(12)(13)(14) (Table 1) using either microarrays or fluorescence polarization (FP). The six studies that quantitatively measured affinity represent ~90,000 pairs of domain-peptide interactions, but these measurements cover less than 2% of possible interactions. In response, computational approaches have attempted to predict as-of-yet-unmeasured interactions using the published interaction data. These methods span the range from thermodynamic models which predict interaction strength using existing structure and binding measurements (15)(16)(17) to supervised machine learning models using patterns in peptide sequences and quantitative binding data to predict binding (14,18). However, no computational method has used the available affinity data in its entirety. We therefore wished to leverage all available binding affinity data in a supervised learning approach to expand our knowledge of SH2-pTyr interaction space.
Unfortunately, in the process of reviewing published HTP data, we found surprising disagreement between publications about which domain-peptide pairs interacted. For the limited number of interactions for which they agreed, they reported vastly different affinities for interacting pairs. We identified two issues common to all of the data sets that could be responsible for the discrepancies in results: errors affecting protein concentration, and improper use of statistical methods affecting modeling results.
First, we found potential sources of errors in protein concentration that could affect reported affinity values. Protein was minimally purified (via nickel chromatography), and protein concentration was measured by absorbance. No study used positive controls to determine the degree of protein functionality before measuring affinity. Thus protein of varying degrees of purity, functionality, and non-monomeric content were used for affinity measurements. Impure or degraded protein causes overestimation of protein concentration when compared to the amount of active protein in a sample. These protein concentration errors can propagate directly to errors in affinity, because affinity is derived from concentration and activity.
Second, we found errors in model fitting and statistical methods used to evaluate model fitting, which could have significant impact on the reported affinities. All of the affinity studies used the receptor occupancy model and the coefficient of determination (r 2 ) as a determination of how well the model fits the data. For linear models, one can interpret the values of r 2 between 0 and 1 as the total percent of variance explained by the fit. However, when applied to non-linear models (like the receptor occupancy model used in each of these studies to derive affinity), the r 2 value cannot be interpreted as the percent of variance, and has been conclusively shown to be a poor indicator of fitness (19). Although this fact has long been established in the statistical literature (20-26) r 2 is still commonly used to evaluate non-linear models in pharmaceutical and biomedical publications despite being an ineffective and misleading metric. In these publications, the use of r 2 effectively resulted in a bias for true positive interactions at the expense of making many false negative calls, and removed many replicate measurements from incorporation into the reported affinity values.
Therefore, due to both inaccuracies in quantitative results and the significant potential for large numbers of false negative results, we had serious concerns about using the published affinities in machine learning. To overcome these issues, we decided to retrieve and reanalyze available raw data in order to systematically improve classification and affinity accuracy for SH2pTyr interactions. To accomplish this we: 1) refined model fitting techniques, 2) implemented fitting multiple models to each measurement, 3) used a statistically accurate method for model selection, 4) developed methods to identify and remove non-functional protein from the results, and 5) introduced a novel method to address the effects of protein concentration errors on reported affinity.
Our revised analysis improves affinity accuracy, improves specificity by reducing the false negative rate, and results in a dramatic increase in useful data due to the addition of thousands of true negative results. Evaluation of the revised affinities shows improved learning accuracy within an active learning modelsuggesting that there is improved coherency in the features of the revised dataset.

Evaluation of published affinity data and acquisition of raw data
In the process of evaluating published high-throughput data, we found significant disagreement between data sets. We evaluated all publications using HTP methods to measure SH2 domain interactions with specific peptide sequences. The publications containing SH2 affinity data can be grouped into three, distinct data groups (Table 1). The first data group consists of the group of studies published by the MacBeath lab from 2006 to 2009 (7,9,27) which contain a body of predominantly non-overlapping protein microarray (PM) experiments. The second data group consists of a large study published by the MacBeath lab in 2013 (10) with a set of new PM measurements using the protocol published in 2010 (28). The third data group consists of two non-overlapping sets of fluorescence polarization (FP) experiments published in 2012 and 2014 by the Jones lab (13,14). Because the other experiments (11,12) only measured interaction and not affinity, they were not considered for this analysis.
In order to determine agreement between data sets, we examined both qualitative and quantitative results. First, we examined the correlation between domain-peptide affinity measurements which overlapped between any two data groups (Fig. 1, top row). We found surprisingly low correlation between affinity measurements (with a maximum correlation of r = 0.367). Next we asked if the different data groups identified the same positive interactions between domain-peptide pairs, even if they did not agree on the affinity measurements. Here, we found significant disagreement over which domainpeptide pairs were found to interact (Fig. 1, bottom  row). Of 347 positive domain-peptide interactions identified by one or more groups, fewer than 16% (55/347) were found to interact in all three data groups. No two experiments were able to agree on more than 29% of the positive interactions. The differences in interaction identification were spread randomly among SH2 domains and peptides, with no single SH2 domain, peptide, or peptide family being overrepresented (Fig. S1). These findings demonstrate significant quantitative and qualitative differences between published data from different labs, and even disagreement between publications from the same lab.
Although there are significant differences between the techniques of PM and FP experimental methods, the differences between positive interactors did not group by technique ( Fig. 1) and different techniques had similar numbers of common positive interactions. Although very low, the best correlation (r = 0.367) was between a PM experiment and an FP experiment from the same lab. All three data groups used similar protein and peptide production and purification methods, absorbance for determination of protein concentration, the receptor occupancy model for determining affinity, and similar methods of evaluating model fits based on the coefficient of determination (r 2 ).
We concluded that we would need to look further into the methods and raw data to evaluate the differences between published data sets, or even to evaluate the quality of any single set of published data. Acquisition of raw data from published studies was surprisingly difficult. No publication included raw data, only supplemental tables with post-processed values for affinity which are insufficient for replication of published results. Furthermore, we discovered that most raw data underlying the published analysis has been lost by the original authors and is no longer available from any party. (Table 1) Fortunately, we were able to retrieve raw data from the Jones 2012-14 data group (personal communication from Richard Jones, Ron Hause, and Ken Leung).

Raw SH2 interaction data and revised analysis
We began by examining the raw data from the Jones 2012-14 data group to evaluate the quality and completeness of the data, and to review the methods used to process the raw data into its published form. Although some raw data was missing in comparison to the original  publication, by limiting our revised analysis to  interactions of single SH2 domains with  phosphopeptides from the ErbB family (EGFR,  ERBB2, ERBB3, ERBB4), as well as KIT, MET,  and GAB1, the available raw data covered  approximately 99.6% of the reported measurements.
Evaluation of the implementation of the receptor occupancy model. The raw data for each measured interaction consisted of fluorescence polarization measurements of an SH2 domain in solution with a phosphopeptide at equilibrium at 12 concentrations. In the original publication, the raw data was then used to derive an equilibrium dissociation rate constant (K d ) by fitting the receptor occupancy model (developed by Clark in 1926 using the law of mass action (29)). As applied to the fluorescence polarization data, the model takes the form: where F obs is the observed FP signal at each assayed protein concentration of the SH2 domain (measured in millipolarization units (mP)), and F max represents the FP at saturation (see also Fig.  S2). The affinity (K d ) and saturation limit (F max ) are fitted parameters of the model. It is important to note that this model is dependent on several critical assumptions: that the reaction is reversible; that the ligand only exists in a bound and unbound form; that all receptor molecules are equivalent; that the biological response is proportional to occupied receptors; and that the system is at equilibrium. We hypothesized that the specific methods used to implement the receptor occupancy model in the original publications might have affected the accuracy of the originally published results. We examined three aspects of the implementation of this model. First, we reviewed the method of subtraction of background fluorescence and found that it introduces systematic random errors in affinity results. Second, we evaluated whether the receptor occupancy model could reliably fit a nonbinding sample. When we found it could not, we implemented an alternate model and a model selection procedure in order to more reliably identify negative interactions. Finally, we examined the effect of dropping outlier measurements on model fitting results, and implemented an alternative method to determine model fitness: signal-to-noise ratio (SNR).
Background subtraction causes errors in model fits and is replaced by fitting an offset. In the original analysis, the authors used a plate-wise background subtraction method, where the median baseline control value was recorded from plate measurements and subtracted from the polarization signal observed at each data point (13). When plates had excessive variation in baseline control values, the authors excluded these results from further analysis. However, in examining many measurements by eye, we found that the background values seemed uncorrelated with the signal values (Fig. S3).
A critical feature of the receptor occupancy model is that the saturation curve passes through the origin (because the point of zero-signal is also the point of zero-concentration). Thus background subtraction forces the zerosignal to a point other than that of zeroconcentration, resulting in higher residual error which introduces errors in derived affinity (Fig.  S4). These errors increase or decrease affinity (based on whether the background is high or low), and are non-linear by affinity. Since the relationship of the background was seemingly random, and the error factors are non-linear, background subtraction injected random error in affinity calculations. More than 54% of the replicate measurements exhibited problematic background levels (Fig. S4, bottom row). Thus, we rejected the background subtraction method in favor of fitting the model along with an offset value (F 0 ): Example fits using the receptor occupancy model with offset can be seen in Fig. S5.
The receptor occupancy model fails to accurately identify non-binding measurements; introducing a linear model. Although the receptor occupancy model is theoretically capable of fitting a typical binding saturation curve as well as a 'flat' curve representative of non-binding interactions, we found that it fails to accurately fit non-binding interactions in practice (Fig. S6), resulting in artefactual model fits with unreasonable affinity and saturation values.
Because negative interactions resemble low-slope or zero-slope lines with superimposed random noise, we hypothesized that a linear model would more reliably fit these 'non-binders' and resolve fit artifacts. The linear model is: (where F 0 represents an offset value, and m is a constant representing the slope of the fitted line, Fig. S6, red fits). Out of 37,378 replicate measurements, 31,861 were best fit by the linear model. Of these, 29,778 were initially classified as non-binders. We also found a group of replicate measurements (~6%) which were best fit by a linear model but with steep positive slope. Linearly increasing fluorescent signal with no indication of saturation violates the assumptions of a receptor occupancy model, and is more likely to represent a form of protein aggregation, peptide aggregation, or some other form of non-specific binding. Thus, to preserve the quality of the nonbinding calls, a conservative slope cutoff of 5mP/µM was implemented, above which replicates were identified as aggregators, and removed from further consideration. Of the 31,861 replicates best fit by the linear model, 2083 were initially classified as aggregators.
Fitting multiple models requires a model selection process. When more than one model can be used to fit the data, a method of model selection must be implemented to determine which model most accurately represents the data while balancing against adding additional parameters which can lead to overfitting. In order to determine if a measurement is best described by a receptor occupancy model or a linear model we used the Akaike Information Criterion (AIC). In contrast to the coefficient of determination (r 2 ), AIC is a model selection metric which is appropriate for use with non-linear models (19,30), is robust even with high noise data, and employs a regularization technique to avoid overfitting by penalizing models with more parameters (the receptor occupancy has 3 parameters; the linear model has only 2 parameters). In our implementation we used a bias corrected form of the metric, AICc, in order to account for only having 12 data points per saturation curve. A lower AICc score indicates a better fit. Examples of model fitting can be seen in Fig. S7.
Evaluation of model fitness. In order to determine how well the data was represented by a model, we used the signal to noise ratio (SNR) as a metric of model fitness. The SNR metric represents the magnitude of residual errors of fit to the model (a form of noise), and weighs this sum by the overall size of the fluorescent signal measured. It is calculated as where n is the number of data points, R i is the residual value of the i th data point, and F obs is the observed fluorescence (in mP units).
At an SNR≥1, the measured signal is larger than the sum of all errors to the fit, and represents a good quality fit in practice. We chose a ratio of 1 as the limit of a good fit based on extensive visual inspection of the fits (see Fig. S8 and Fig. S9 Outlier removal biases model fitting and selection. In the original publications, the authors utilized an iterative outlier removal process. For each set of 12 data points in a replicate measurement, individual points were identified as outliers using a statistical model. The outlier was removed and the fit was reevaluated. Up to three points were removed per replicate measurement. For measurements where more than three data points were identified as outliers, the replicate measurement was removed from further consideration. For an ideal binding saturation experiment attempting to identify K d , the concentrations tested should span either side of K d , and the highest and lowest measured concentrations should establish the plateaus seen on semi-log saturation plots (See Fig. S5). Based on the concentrations selected for this experiment, the ideal range for quantification is affinity (K d ) in the range of 0.05µM to 0.5µM. For interactions with a K d >1.0µM, the upper plateau of the semi-log saturation curve no longer has any coverage ( Fig. S5, row 2). Interactions with K d >5µM have no data points at all above K d (Fig. S5, rows 3 and 4). This suggests that every data point is critical for accuracy, particularly points above K d , thus we chose to use all data points to avoid introducing additional error and to allow the SNR metric to gauge the quality of fit.
Summary of revised analysis method for replicate measurements. Following a systematic review of each decision made in evaluating a measurement in HTP affinity studies we developed an improved analysis pipeline (Fig. 2). For each replicate measurement, we fit two models: a receptor occupancy model with offset (equation 2) and a linear model with offset (equation 4). Fits were evaluated with AICc: the model with the lower score was chosen as the best fit. Replicates that were fit best by the linear model and had a slope of less than or equal to 5mP/µM were classified as negative interactions, or 'non-binders'. Linear fits with a slope greater than 5mP/µM were classified as aggregators and removed from considerations. A replicate that was fit best by the receptor occupancy model was then evaluated for signal-to-noise ratio (SNR). If the SNR was greater than one, the replicate was classified as a positive interaction or 'binder'. Out of 37,378 replicate measurements, we identified 7.4% (2753) as binders, 79.7% (29,778) as nonbinders, 7.4% (2764) as low-SNR fits, and 5.6% (2083) as aggregators (Fig. 3, left side).

High variation at the replicate level is likely caused by protein concentration errors
The original publication reported a single affinity (K d ) value for each domain-peptide pair, which was the average of multiple replicate domain-peptide measurements. However, we found patterns of high variation in affinity between replicates that suggested a significant problem with the either experimental design or experimental method. We hypothesized that a single variable -errors in protein concentrationcould be responsible for the high variance.
In reviewing data quality, we identified a large number of domain-peptide interactions demonstrating high variance in affinity among replicates (for example, K d values ranging from below 0.5µM to over 20µM for replicates from a single domain-peptide interaction). In order to determine the source and character of the variance, we inspected replicates as a group for individual domain-peptide interactions (for a representative example, see Fig. S10). Despite high variance between replicates, each replicate measurement had high quality fits and low residual error, as expected from meeting an SNR>1. (For a representative example of all measurements from one such replicate group, see Fig. S10).
To explore this further, we visualized and quantified variance (Fig. 4) for all domain-peptide interactions. Although variance tends to increase as K d increases, variance greater than 10µM is found across a large fraction of all measurements, independent of affinity. How could high-quality individual replicate measurements result in such varied affinities for a single domain-peptide pair? We hypothesized that protein concentration error (arising from differences in protein preparations such as impurities, degradation and inactivity) could directly propagate to errors in modeled affinity values while still producing high-quality individual replicate saturation curves.
To test this hypothesis, we first examined the theoretical effects of protein concentration error on affinity. We demonstrate that concentration errors directly manifest as errors in affinity, and that errors from impurity or degradation systematically manifest as artificially high K d (lower affinity). Next, we examined the methods and data for sources of purification errors, partial degradation, and complete protein inactivity, and identified evidence of all three.
Finally, we developed a method to control for these sources of protein concentration error and produce affinities with higher accuracy using the existing raw data.
Protein concentration errors propagate directly as errors in derived K d values. Although binding affinity is a molecular property -affinity is the strength of interaction between a single protein molecule and a single peptide -accurate derivation and calculation of affinity by most methods depends on the accuracy of concentration measurements for the tested protein. In the case of the receptor occupancy model used here, affinity is a derived function of concentration and FP response. Because impurities or degraded protein represent an error between the assumed concentration and the active concentration of a protein, we hypothesized this would propagate to errors in affinity.
We examined the theoretical effect of concentration errors on measured affinity (Fig. 5). Errors in protein concentration due to impurities or degradation cause an overestimation of the true concentration of active protein. Overestimation errors in protein concentration cause errors in K d , always resulting in a higher K d (lower affinity) than the true value. This error is linearly proportional to the error in concentration.
Thus, protein concentration errors due to batch impurities or degradation can manifest as a range of K d values in replicate measurements made from different batches of protein, all of which would be equal to or higher than the true K d , while simultaneously coming from highquality, low-noise replicate fits. This exact phenomenon has also been demonstrated experimentally (31).
Evidence for protein concentration errors due to protein degradation or impurity. The original publications used His 6 -tagged recombinant SH2 domain protein production methods, and used nickel chromatography as the sole protein purification method. In theory these methods can provide purities of up to 95% (32). However in practice the results can vary significantly, and can be affected by the amino acid content, nonspecific binding, purification conditions, and the type of affinity matrix used (32). Our experience in the lab performing these purifications suggests that differences in purity between different protein preparations are likely to be present. Because the method used to determine protein concentration was absorbance at 280nM, only total protein content is measured, independent of purity or activity.
If the variance in affinity was from a random (non-systemic) source, we would expect to find no patterns of variance in time. In contrast, if variance was from batch-related protein degradation or impurities, we might see alternating patterns in affinity over time as different batches are used. For example, if high-purity protein sample were used on run 1 and a low-purity protein sample were used on run 2, we would expect consistently higher affinities on run 1 and consistently lower affinities on run 2. Or, if a partially-degraded protein sample was exhausted mid-run, and replaced with a fresh sample, we could see a sudden surge of higher affinity results in the middle of a run, when compared to other runs. Similar patterns could arise from batch to batch variations in purity affecting accuracy of expected concentration.
We examined the data for evidence of these patterns. Since we do not have true information at the batch level or activity of each protein sample, these patterns must be inferred from the data. Although these patterns are difficult to spot due to the nature of the experimental design, we find examples of non-random rundependent variations in affinity in the data (Fig.  S11). These patterns are not compatible with a random source of variance, and are compatible with either degradation or protein impurity causing errors in protein concentration.
Evidence for complete non-functionality of protein domains. Because we found patterns consistent with partial degradation, we examined the data for patterns of complete protein degradation. Complete degradation, or completely non-functional protein, would be indistinguishable from a non-binding measurement for a single replicate, potentially resulting in a false-negative. A control experiment to determine protein functionality would normally be required to delineate these two cases. However, we hypothesized that non-functional protein would manifest within the data as long runs of nonbinding results across many replicates, but would demonstrate contradictory evidence of binding on other runs when the protein was not degraded. We found patterns consistent with non-functional protein (Fig. S12). Non-functional protein domains were identified and removed from consideration (See Methods, and Fig. S13 and Fig.  S14).
By removing replicates where there is evidence that the protein was non-functional, we avoid the potential for false negatives from this ambiguous data, and greatly improve the pool of true negative calls. Removal of non-functional protein has a significant impact on the numbers of measurements at the replicate level. Nonfunctional replicates made up 37.6% of all replicates (Fig. 3, right side). The large number of runs showing patterns of completely nonfunctional protein contributes to the overall evidence that protein degradation is present and is a source of variance in the data.

Method for handling replicates with high variance due to protein concentration errors; Reporting the minimum instead of the mean.
Two key issues arise when considering how to handle replicate measurements when impurities or degradation are suspected to be a primary source of variance. First, without knowing the exact amount of protein concentration error in any one sample, how can this error be controlled for? Second, what is the correct procedure for handling replicates when variation is primarily due to concentration errors and not random sample variation? We propose a simple but novel, solution to both questions: reporting the minimum rather than the mean of the replicate measurements results in the most accurate reported measurement.
Impurities and degradation can be partially controlled for by reporting the minimum replicate K d . Given some unknown amount of protein concentration error due to degradation or impurities, the active concentration of protein will always be equal to or lower than the measured concentration. And as we demonstrated above, this means that the true affinity of the protein will always be equal to or greater than the measured affinity. Put in terms of K d : the true K d will always be equal to, or lower than the minimum measured K d . Thus, the minimum K d reflects the closest measured value to the true affinity.
Furthermore, reporting the minimum measured K d also addresses the variance problem. If the measurements were true replicates, reflecting random noise and experimental error, taking the mean of multiple replicates would be the appropriate procedure because the sample mean would represent the highest likelihood of the true population value of affinity. However, if the variation is caused by protein concentration errors, taking the mean of multiple measurements would not reflect the true affinity. Rather, it would inadvertently increase the reported K d value by some unpredictable amount, which depends on the number of samples and the magnitude of their degradation. In addition, since the mean is particularly affected by outliers, even one severely degraded sample would significantly increase the mean reported K d value, resulting in a reported affinity with high error. Therefore, odd though it may seem from a statistical perspective, taking the minimum K d is the most accurate way to handle variation in replicates where errors in protein concentration overestimation represent the primary source of variation.

Revised affinity results and comparison to the original published results
In the results from our revised analysis, 1518 positive (binding) interactions were identified, along with 7038 negative (non-binding) interactions. These ~7000 true negative results represent a significant increase in information from the original publication in which no true negative interactions were reported. For 3200 interactions, inconclusive or problematic data was present and no conclusions about their affinity could be drawn. Of those, 2753 potential domainpeptide interactions remain unevaluated due to non-functional protein. Final affinity values were plotted for all peptide-domain interactions as a heat map (Fig. 6), and summarized by category of interaction and changes in calls (Fig. 7) Despite similar numbers of positive interactions between the original and revised results (1519 vs. 1518), the identities of the domain-peptide pairs comprising the positive interactions changed significantly (Fig. 7). More than 17% of the original positive interaction calls changed to either non-interactions, or rejected results due to data quality issues. In the final model, 166 interactions originally called positive in the published results are found to be true negative interactions. These changes are primarily due to the ability to avoid fit artifacts and falsepositive results, a consequence of using multiple models to fit the data. Similarly large changes were found in the originally published negative interactions where 273 formerly rejected interactions are now classified as true positive interactions. These recovered results are primarily due to changes using offset fits instead of background subtraction, and using an appropriate quality metric to determine which model fits best. Changes in calls by class are visualized in Fig. 7, while the identities and magnitude of the domainpeptide pairs with changed calls are visualized in Fig. S15. Results from the original publication are visualized in Fig. S16.
Furthermore, even though 1245 domainpeptide pairs were found to bind in both the original publication and our revised analysis, the quantitative affinity of those binders changed significantly in the revised analysis (Fig. 8). Note that although the minimum of each replicate group was selected as most accurately reflecting the true affinity, our revised affinity values are not all lower than the original publication. This is primarily due to significant changes at the replicate level -where some original replicates were removed from consideration by changes in the fitting process, and a number of new replicates were included in each replicate set.

Independent evaluation of revised analysis: measuring improved consistency via active learning
We wanted to evaluate our revised analysis compared to original results. In a case such as this, it is difficult to evaluate because original samples are no longer available. However, one way to evaluate the data is to use machine learning methods to ascertain whether the revised data has better internal consistency or predictive power than the original data set. Lacking a biological reference, it seemed fitting to evaluate this data using machine learning, as we originally wished to harness SH2 domain binding measurements in machine learning frameworks to extrapolate from the relatively small number of available measurements.
To do this, we implemented active search, a machine learning approach that is highly amenable to biochemistry problems such as this. Active learning (also known as optimal experimental design or active data acquisition) is a machine learning paradigm where we use available data to select the next best experiments to maximize a specific objective. Active search is a realization of this framework where the objective is to recover as many members of a rare, valuable class as possible. In this case where only 13.9% of the original dataset represents positive interactions between an SH2 domain and a phosphopeptide (or 18.2% in the revised dataset) the objective of the search algorithm was to prioritize each sequential selected interaction to maximize the total number of positive interactions discovered. We implemented the effective nonmyopic search (ENS) algorithm (33) with the goal of optimizing the total positive experiments identified in an allocated search of 100 queries. The algorithm was seeded randomly with one example positive before search progressed and was repeated 50 times.
ENS showed improved average performance and higher consistency with our revised dataset. First, ENS worked effectively on both the original and revised datasets, identifying positives that far exceed the expected number by random chance by the 100th query (Fig. 9). This suggests that phosphopeptide sequences do encode information about whether an SH2 domain will recognize them in a binding interaction. Second, ENS performance in the revised dataset was higher than the original dataset on average, finding 45.3 positives vs. 33.3 positives (p-value of 4e-12). Third, ENS performance is significantly more variable on the original dataset than on the final dataset (ranging between 9 and 62 positives in 50 trials (with an average of 33.3), compared to a range of 38 to 67 (with an average of 45.3 positives) for the revised dataset. In the worst of the 50 trials, search in the original dataset underperformed by 50% compared to what is expected by random chance), whereas the worst random trial within the final dataset still outperformed random chance by two-fold. Thus, the improved average performance and lower variability in our revised results suggests improved coherency in our revised analysis over the original published results.

Discussion
Here, we performed a revised analysis of raw data from SH2 domain affinity experiments. We presented a new analysis framework which improved on the model fitting and evaluation methods of previous work. We report highconfidence true positive interactions, added thousands of true negative interactions, and removed false negative results due to inactive protein. We also report the minimum replicate measurement instead of the mean -an appropriate approach when protein concentration overestimation errors are the largest source of variance in the data.
Although raw data from only two experiments was available for detailed analysis, we were fortunate that it consisted of a large quantity of measurements from FP, a wellestablished solution-based experimental system commonly used for analytical biochemical assays. All in vitro experimental methods have limitations when attempting to understand behavior in vivo, but early high-throughput experiments used arrays that had limitations and biases for higher affinity interactions (13). Those experiments had either the peptide (11,12) or the protein (7-10) mounted on a surface, and are less preferable to a method where both molecules were measured in solution. So despite limited availability of raw data, the data available is likely to be the best type for further analysis.
We saw very high variance in affinity within replicate measurements in this data. On its face, such high variation suggests a significant problem with either experimental design or experimental method. At a minimum, it suggests that another (uncontrolled for) variable is being measured instead of the desired variable being tested. In the worst case, the remedy requires identifying and controlling for the source of variation, and redoing the experimental measurements. Even the authors of the original publication argued that the "greatest source of variability in the FP assay...is batch-specific differences in protein functionality." (13) However, we have shown that the patterns found in the data are consistent with protein concentration errors, and that the likely sources of error (purification and degradation) result in overestimation of protein concentration. Because these types of errors all result in unknown amounts of active protein concentration overestimation, reporting the minimum replicate K d for each domain peptide pair represents the value closest to the true activity of the protein.
In this analysis we implemented a simple metric of quality, SNR, which weights the total fit error by the size of the signal. The SNR metric was effective at eliminating suspect fits, while rejecting very few high quality measurements (Fig. S9). Nevertheless, this metric may not be appropriate for all types of data, particularly data with a large prevalence of single point outliers. We extensively explored using alternate metrics, including confidence intervals (CI). Bootstrapped CIs, established by parametric boot strapping via residual resampling, can add more information than a single fit result because they provide a range of certainty for a given measurement. However, we found that this method had significant limitations on this data, and performed worse than the SNR metric. In this data, bootstrapped CIs have even greater vulnerability to errors from outliers, are limited by small sample sizes (only 12 residuals per measurement), and suffer from heteroscedasticity of residuals (causing the high variability in low-concentration data points to be assigned to high-concentration data points) ultimately resulting in unrealistic intervals for affinity.
Several analysis methods implemented in the original publications served as sources of randomizing error, and may suggest a reason for the failure to agree with other published SH2 interaction experiments. First, background subtraction caused an unpredictable increase or decrease of affinity due to forced errors in model fitting. The magnitude of the error depended on whether the published background was higher or lower than optimum, and on the affinity of the interaction being measured. Even small deviations could result in significant errors. Another seemingly innocuous choice -averaging multiple replicates containing degraded protein -is likely to be a significant source of error in the originally published results from this experiment. Taking the mean of multiple replicates is a standard practice, but serves to randomize reported values when protein concentration overestimation is the primary source of variation.
Other high-throughput SH2 domainpeptide experiments share many critical methods with the data reviewed here. In all published experiments measuring affinity, protein was minimally filtered after production. The limited purification is likely to result in errors in protein concentration measurements due to inactive protein contaminants. Furthermore, in none of the experiments was protein assessed for activity before being measured. This has two critical consequences: the inability to separate nonbinding results from negative interactions due to non-functional protein, and additional errors in active protein concentration with respect the measured protein concentration. Even if protein concentration errors were solely due to purification, it could be the cause of the significant discrepancies between published numerical results. Furthermore, incorrect use of statistical methods to evaluate models was common to all published work -particularly the improper use of the coefficient of determination (r 2 ) to determine the quality of fit of a non-linear model, and using only a single model to fit data. These choices result in a high false negative rate, and mask the high variance in replicates that our revised analysis revealed. Our results suggest that, if the raw data were available, some of these issues could be corrected in other experiments. However, due to the lack of correlation between any published high-throughput SH2 domain data, and the likelihood that similar issues plague all similar data sets, we would recommend against use of these previously published data sets in future research or models of SH2 domain behavior. We further recommend that all derivative work should be carefully reviewed for accuracy.
We want to address the best uses of the revised affinity results we present, as well as the limits of the current analysis. The negative interactions we report represent a significant improvement over theoretical methods of simulating negative interactions (18), as they are based on real measurements rather than statistical assumptions.
Furthermore, the negative interactions we report are controlled for false negative results from non-functional proteinsomething no other SH2 domain data can claim. Thus, our revised results have significant potential to improve the quality of models built on categorical (binary) binding data. The limitation of the quantitative data we report is that the highest affinity measured value may not be the true affinity if a fully functional protein was never measured. Nevertheless, the highest measured affinity still represents the measured value closest to the true value. However, not all variation in the data is consistent with our hypothesis of protein concentration error, and some variation may represent other unknown sources of variation which we have not controlled for. For example, one key assumption of the receptor occupancy model requires measuring the reaction at equilibrium. Since no data is provided to prove that the 20-minute incubation time given to all samples was sufficient to bring all reactions to equilibrium, it is possible that some variation is due to measurements made in non-equilibrium conditions.
It is concerning that an entire body of published work has developed from this class of problematic results. These experiments have had a wide-reaching effect in many areas of SH2 domain research: the data has been used to draw specific conclusions about SH2 domain biology such as identification of EGFR recruitment targets (34), to explain quantitative differences in RTK signaling (9), and as evidence to understand the promiscuity of EGFR tail binding (35). In addition, this work has been used to guide experimental design by filtering potential binding proteins by affinity (36), to reconcile confusing experimental results (37), and to guide new experimental hypothesis testing (38). It has played a role in cancer research as context to understand kinase dependencies in cancer (39), and as evidence of HER3 and PI3K connections as relevant to PTEN loss in cancer (40). It has influenced evolutionary analysis (41), been used to design mechanistic EFGR models (42,43), and has been used in algorithms for domain binding predictions (14)(15)(16)(17)(18)44).
Finally, we would like to discuss best practices for future data gathering and reporting. HTP studies have great value, and provide a vast quantity of often never before measured data. These methods have been useful to a wide variety of domain-motif interactions, for example SH3polyproline interactions (45,46), PDZ domains interacting with C-terminal tails (47)(48)(49), and major histocompatibility complete (MHC) interactions with peptides (50,51). However, just as quickly, errors in these studies propagate rapidly and thereby into research results of other investigators. This suggests that an even higher than normal standard of care is necessary when evaluating such publications. A set of best practices for HTP methods should be established in the community. We recommend all raw data from high throughput experiments should be published, along with all code used to process that data. This would make the initial data far more valuable for future research, much like the raw arrays stored Gene Expression Omnibus (GEO), or the raw experimental measurements are stored along with the protein structure in the Protein Data Bank (PDB). To this end, we have provided the original raw data and our full revised data (including intermediate steps) on Figshare (DOI: https://doi.org/10.6084/m9.figshare.11482686.v1), and provided the code for the analysis pipeline on GitHub (https://github.com/NaegleLab/SH2fp) so that future evaluation can be more easily accomplished by other researchers. Although portions of our code are highly specific to the format of these datasets, the code is written in a modular fashion that can be easily repurposed in other studies. We also recommend that methods for quantifying activity should be a best practice in studies quantitatively measuring protein. Alternatively, methods which do not depend so heavily on accurate protein concentration should be preferred. One such concentration-independent method of measuring interaction affinity was recently developed by the Stormo lab (52). In that method, a 2-color competitive fluorescence anisotropy assay measures the relative affinity of two interactions in solution. By measuring interaction against two peptides at once from the same pool of proteins, the concentration of the protein and the proportion of active protein is the same in both interactions. When the ratios are calculated, the concentration and activity drop from the calculation of affinity. Although this method only provides relative affinity, if one could carefully establish absolutely affinity for a single peptide (or panel of peptides), absolute affinity could be extended to all interactions. Another recent experiment also uses competitive fluorescence anisotropy, but measures a competitive titration curve in a single well with an agarose gradient (53). Diffusion forms a spatiotemporal gradient for the interaction, and so one can produce a full titration curve in each well in a multi-well plate, measuring both affinity and active protein concentration simultaneously.
Regardless of the specific method, it should be a best practice to account for or control for the concentration of active protein within the measurement of total protein concentration.

Methods
Raw Data. Upon receipt of the Jones 2012-14 raw data, we examined the data for consistency and completeness. We found that the data did not cover all interactions described in the original publication. However, by limiting our revised analysis to interactions of single SH2 domains with phosphopeptides from the ErbB family, as well as KIT, MET, and GAB1, we were able to limit the effect of missing raw data. Within this scope, only a handful of individual replicate interactions were then missing (approximately 138 replicate-level measurements out of over 37,000 measurements) and were limited to 3 domainpeptide pairs. Fortunately, two of the domain- Within this revised scope, the available raw data covered approximately 99.6% of the originally available raw data.
The raw data for each measured interaction consisted of fluorescence polarization measurements of an SH2 domain in solution with a phosphopeptide at 12 concentrations. The measurements were arranged on 384 well plates: 32 different SH2 domains at each of 12 concentrations, all measured against a single peptide per plate. Protein concentrations represented 12 serial dilutions of 50% starting with either 10 µM or 5 µM protein.
Model Fitting, Model Selection, and Replicate-Level Calls. For each replicate measurement, we fit two models: the linear model (equation 4) and the receptor occupancy model (equation 2). Model fits were evaluated with the bias corrected Akaike Information Criterion (AICc), and the model with the lower AICc score was selected (19).
The Akaike Information Criterion (AIC) as a quality metric, was calculated by where p is the number of parameters in the model, and ln(L) is the maximum log-likelihood of the model. In a non-linear fit, with normally distributed errors, ln(L) is calculated by where x 1 , ..., x n are the residuals from the nonlinear least squares fit and N is the number of residuals. The bias corrected form of AIC, referred to as AICc, is a variant which corrects for small sample sizes, e.g. when one has fewer than 30 data points. AICc is calculated as follows: where n is the sample size, and p is the number of parameters in the model (19). Each replicate had a sample size of 12. The receptor occupancy model had three parameters (affinity (K d ), saturation level (F max ), and offset (F 0 )), while the linear model had two parameters (slope (m), and background offset (F 0 )).
Replicates that were fit best by the linear model with a slope of less than or equal to 5mP/µM were categorized as negative interactions, or 'non-binders'. Linear fits with a slope greater than 5mP/µM were categorized as aggregators. Replicates that were fit best by the receptor occupancy model were subsequently evaluated for signal to noise ratio (SNR, equation 3). If the SNR was greater than one, the replicate was categorized as a positive interaction or 'binder', otherwise, it was rejected as a low-SNR fit and removed from consideration.
Identifying Non-Functional Protein. Once all individual fits were complete, runs were examined for non-functional protein. If an entire run lacked even one positive binding interaction, and those same interactions measured positive on another run, the non-binder, aggregator, and low-SNR calls on that run were changed to nonfunctional protein and removed from consideration.
Replicate Handling for Domain-Peptide Measurements. For each domain-peptide pair, only replicates that were marked as binders with sufficiently high signal to noise ratio (SNR) were considered. For a given domain-peptide pair, the minimum numeric value of K d (representing the strongest affinity) was reported as the final K d for that domain peptide pair.
Active search. The probability model ( where d nn is the distance used to define nearest neighbors, d e is the Euclidean distance, n is the number of amino acids in the peptide (here n = 9), and dpps(x i ) is the DPPS feature vector of the i th amino acid in peptide x.       (Measurements in the second column are the same data as the first column, but plotted as semi-log plots with a logarithmic concentration axis.) In rows 2 and 3, the results of 50% and 75% degradation are shown. To simulate the effect of degraded protein, we plotted true activity from the ideal curve (row 1) against the erroneous assumed concentration due to degradation (rows 2 and 3). For example, to simulate 50% degradation (row 2), the true FP response for 5µM (from row 1) is plotted at the 10µM position (on row 2). This procedure is repeated at each concentration. When affinity is derived from these degraded protein measurements, the result is an inaccurate K d higher than the true value. Although the concentration error from degraded protein causes a non-linear change in FP, the error in K d is linear and proportional to the concentration error. For example, if the true active protein concentration is ½ of the assumed concentration (as in row 2), the measured affinity is ½ of the correct value (meaning the K d is 2x the true value). If the true active protein concentration is ¼ of the assumed concentration (as in row 3), the measured affinity is ¼ of the correct value (the K d is 4x the true value). Therefore errors from overestimation of protein concentration always result in higher measured K d than the true value.   Despite choosing the minimum measured value for K d , our revised data often reports higher K d results than the original publication (i.e. results below the diagonal). This is due to different categorization and filtering procedures which result in significant additions and removal of individual measurements in each set of replicates for a domain-peptide pair. It is interesting to note that correlation does not improve at higher affinity (lower K d ), despite the fact that the chosen raw measurement range is tailored for highest accuracy for K d < 1.0 µM. This suggests that the differences between our revised results are independent of the accuracy of the original measurements, and more likely due to the need to correctly handle variation due to protein concentration errors.

New analysis pipeline for high-throughput domain-peptide affinity experiments improves SH2 interaction data
Tom Ronan 1 , Roman Garnett 2 , and Kristen Naegle 3

Supporting Information
Final Revised Affinity Data.xlsx -contains the interaction affinities between domains and phosphopeptides based on our revised analysis.  increasing saturation with increasing fractional occupancy, with full saturation achieved at F max . The affinity (quantified by the K d ) can be derived by fitting a curve to the data, but can also be derived graphically as seen above. At equilibirum, the K d is equal to the concentration when ½ of the receptor is occupied (½ F max ). In the semilog curve, where the concentration axis is in log 10 scale, K d can be identified easily because it corresponds with the inflection point at the center of the the s-shaped curve.

High Background
Low Background

Background
Coincides with Offset The background FP level should represent the level below which measurements are random noise. For some measurements, background values were higher than many of the individual low-concentration measurements (first row). In these cases, the high quality of the data and the fits seem to contradict the limits imposed by the reported background. In other cases, the background values were significantly lower than the signal level of the lowest concentration measurements (second row). In some cases, the published background value matched the offset value from the model fit (third row). Because of the discrepancy with background values we found and the use of the background subtraction method in the original publication, we decided to evaluate the background data and quantify the effect of background subtraction on the accuracy of the model fits (see Fig. S4). A background 10% higher than optimal (top row, green) causes a change in the fitted curve (top row, green dashed line), which results in a derived K d of 3.59µM (a +179.5% error). A background 10% lower than optimal (top row, orange) causes a change in the fitted curve (top row, orange dashed line), which results in a derived K d of 1.13µM (a -40.5% error). Background subtraction always results in an increase in residual error (middle row left), which results in an error in the fitted affinity parameter (middle row, right). Background errors are nonlinear with affinity, and result in larger errors at lower affinities (middle row, left and right). A background error of ±5% results in a -21%/+26% error in K d at 1.0µM, and a -52%/+195% error in K d at 10.0µM. Even a 2% background error results in a -9%/+10% error in K d at 1.0µM, and a -27%/+43% error in K d at 10.0µM. Over 25% (652/2569) of all replicate measurements that demonstrate positive interactions have background errors greater than ±5%, and over 54% (1408/2569) have background errors greater than ±2%. Thus, background subtraction causes errors in affinity which can increase or decrease affinity based on whether the background is high or low, and are non-linear by affinity.
Since the relationship of the background is seemingly random, and the error factors are non-linear, background subtraction acts as a significant source of random error. Based on these findings, we rejected the background subtraction method in favor of a fitted offset (Equation 2). Background Error is expressed as a percent of the saturation value, and is defined as the difference between background and True F 0 divided by the difference between True F 0 and F max (saturation). For an ideal binding saturation experiment attempting to quantify K d , the concentrations tested should span above and below K d , and the highest and lowest measured concentrations should establish the plateaus seen on semi-log saturation plots (all rows, second column). For interactions with 0.1 µM K d (first row), on the semi-log plots, data points are evenly distributed on either side of the inflection point, and establish the lower plateau of no signal and upper plateau of saturation. For interactions with a K d of 1.0 µM (second row), the upper plateau of the semi-log saturation curve no longer has any coverage from the data. For interactions with K d > 5µM (rows 3 and 4), no data points are found above Kd, which significantly increases potential inaccuracies in model fitting. Thus, the concentration ranges chosen make this experiment best suited to identify affinity in the 0.05µM to 0.5µM range. Since data in the original publication was reported up to 20µM, results with low affinities (higher K d values) are likely to be less accurate. In addition, this pattern of coverage suggests that every data point is critical for accuracy, particularly for concentrations above K d . . Upon close examination (row 1, see second column for zoomed view of the same measurement), noise in individual data points can be more clearly visualized. Using a least-squares algorithm to fit the Receptor Occupancy model can result in fit artifacts for non-binding data. The fitting errors follow two patterns: In the first pattern, noise in the data is over-fit, resulting in a rapidly saturating curve, rather than a straight horizontal line (row 1, blue dashed line). This saturation curve poorly fits the data and often has a low saturation value (on the order of 5mP units). Ironically, this artifact results in miscategorization as a binder with a high affinity, rather than the true result reflecting a failure to interact. In a similar type of fit-artifact (row 2), all but one data point is considered to be at saturation, while one single point sets the rest of the saturation curve, resulting in an artifically low K d . In the third pattern (row 3), when there is non-specific binding or aggregation present, data can also present as a line with a high slope showing no signs of saturation. A saturation value of this size cannot result from the one-to-one interaction assumption of the receptor occupancy model, and clearly represents a fit artifact, likely aggregation or other non-specific binding. AICc scores are calculated and compared between models -the model with the lowest AICc score is selected as the best fit. If a linear fit is chosen, and the slope is less than 5mP/µM, the interaction is classified as a non-binding interaction.  Each individual measurement represents a high-quality fit to the receptor occupancy model, yet the resulting affinities vary from ~3µm to ~12µM. It is clear from the quality of each measurement that the variation is not due to noisy data, or fitting artifacts. Rather, each measurement seems to be a high-quality result of different affinity behavior. Although we don't have an exact time for each measurement (and the same peptides were typically measured only once per run) we do have a pseudo-time substitute in the data. On each run, the peptides were measured in approximately the same order, and hundreds of peptides were measured in each run, which allows us to see patterns of protein affinity over time and across peptides from run to run. For example, in PIK3R2-N (upper left panel) we see that Run3 replicates almost always have lower K d values (higher affinities) than replicates from other days. This pattern of run to run variation suggests that the protein samples tested in Runs 1 and 2 may have been degraded or from different protein batches with varying levels of impurities. For RASA1-N (right panel), no single day dominated the highest affinity until plate 174, after which the highest affinity replicates all come from Run 1. A protein sample exhausted mid-run and replaced with a fresh sample, could manifest as a surge of increased affinity in the middle of a run of lower affinity, such as seen in Run 1. Not all protein data shows such clear patterns. For SH2D2A (lower left panel), significant variation appears during Runs 1 and 3 with no run showing the lowest K d , but Run2 shows consistently higher K d values, as well as many failures to detect interactions (blank K d values) suggesting degradation or impurities on the protein from Run 2. The patterns for SH2D2A are not as clearly consistent with a simple concentration error hypothesis, and may be indicative of additional sources of variation.
Fig. S12: Non-Functional Protein Identification -Examples. In order to examine the data for patterns of nonfunctional protein, we plotted affinity by domain and by run. The results before non-functional protein identification (NFPI, left column), after NFPI (right column) are shown for three protein domains. A lack of even one positive interaction on an entire run is suggestive of non-functional protein. When other runs of the same protein show positive interactions, the runs with no positive interactions are considered to be non-functional and all measurements in these runs for the protein are removed from consideration. For example, with GRB2 (row 1), runs 2 through 4 showed some positive interactions. On run 1, however, no measurements indicated positive interactions. The lack of even one positive interaction in run 1 suggests that the protein was completely degraded or non-functional. The presence of positive interactions in the other runs acts as a positive control. Run 1 is then marked in blue in the right panel for GRB2, and removed from consideration. A different case of non-functional protein can be seen with BMX (row 2). For BMX, no positive interactions were found on any run. Although it is a formal possibility that BMX simply binds none of these peptides, we simply have no information that the protein was ever active, thus we conservatively identify all runs as non-functional. For PIK3R1-C, no measurements on the fourth run were positive interactions, while other runs contain positives, thus run 4 was categorized to be non-functional. A binder is identified by a green cell, a non-binder by a white cell, non-functional protein by a blue cell, and a non-measured interaction by a gray cell.