Unique metabolite preferences of the drug transporters OAT1 and OAT3 analyzed by machine learning

The multispecific organic anion transporters, OAT1 (SLC22A6) and OAT3 (SLC22A8), the main kidney elimination pathways for many common drugs, are often considered to have largely-redundant roles. However, whereas examination of metabolomics data from Oat-knockout mice (Oat1 and Oat3KO) revealed considerable overlap, over a hundred metabolites were increased in the plasma of one or the other of these knockout mice. Many of these relatively unique metabolites are components of distinct biochemical and signaling pathways, including those involving amino acids, lipids, bile acids, and uremic toxins. Cheminformatics, together with a “logical” statistical and machine learning-based approach, identified a number of molecular features distinguishing these unique endogenous substrates. Compared with OAT1, OAT3 tends to interact with more complex substrates possessing more rings and chiral centers. An independent “brute force” approach, analyzing all possible combinations of molecular features, supported the logical approach. Together, the results suggest the potential molecular basis by which OAT1 and OAT3 modulate distinct metabolic and signaling pathways in vivo. As suggested by the Remote Sensing and Signaling Theory, the analysis provides a potential mechanism by which “multispecific” kidney proximal tubule transporters exert distinct physiological effects. Furthermore, a strong metabolite-based machine-learning classifier was able to successfully predict unique OAT1 versus OAT3 drugs; this suggests the feasibility of drug design based on knockout metabolomics of drug transporters. The approach can be applied to other SLC and ATP-binding cassette drug transporters to define their nonredundant physiological roles and for analyzing the potential impact of drug–metabolite interactions.

OAT1 and OAT3, which are both expressed on the basolateral aspect of renal proximal tubule epithelial cells, represent the key rate-limiting step for the transport and removal of protein-bound organic anion molecules from the blood and, as such, are essential for the elimination of a wide array of small organic molecule drugs, toxins, metabolites, signaling molecules, uremic toxins, and natural products (5,17). Although OAT1 and OAT3 are generally considered the main organic anion "drug" transporters in the kidney, as well as other tissues (6, 7, 18 -20), their physiological role in transporting endogenous metabolites (21)(22)(23)(24)(25) and in mediating inter-organ crosstalk (the "Remote Sensing and Signaling Theory") is rapidly becoming apparent (5, 9, 26 -31).
Nevertheless, much remains to be elucidated about the mechanisms driving the interaction of SLC22 transporters, like OAT1 and OAT3, with their endogenous (nonxenobiotic) substrates or ligands, particularly with regard to the specificity of the transporters. With respect to pharmaceutical drugs, it is often assumed, based on overlapping lists of drug ligands (3,4,6,7), that OAT1 and OAT3 have similar substrate specificity. In the absence of protein crystal structures in multiple functional states, much research has been performed using ligand-based analyses of the substrates (13,(32)(33)(34)(35)(36). However, it has generally not been easy, using earlier data and methods, to clearly distinguish drugs that bind OAT1 from those that bind OAT3-with the caveat that some OAT3 ligands appear to have a cationic character (33,37).
The drug-transporter interaction datasets, apart from being based on in vitro assays done by different labs (6,7), are unavoidably limited in scope. For instance, certain classes of FDA-approved drugs transported by OATs (e.g. ␤-lactam antibiotics and NSAIDs) are over-represented. In addition, the drug-interaction dataset relies heavily on data derived from transport inhibition assays rather than actual substrate transport data, which tend to be more limited (33,38). Thus, the usefulness of the drug dataset is likely to be less than ideal if one is interested in identifying the molecular properties that might uniquely target a ligand for interaction with a particular transporter (e.g. OAT1 or OAT3). An experimental evaluation of an unbiased selection of transporter substrates is a preferred input for a training set.
Recent metabolomics analyses of the plasma of OAT1 and OAT3 knockout mice (Oat1KO and Oat3KO) have considerably expanded the list of potential endogenous OAT ligands to over 200 (28,34,36,39). Such studies are based on the idea that the absence (deletion) of an uptake/influx transporter, such as OAT1 or OAT3, from the basolateral membrane of the proximal tubule cell in the knockout kidney should result in the inability to remove endogenous substrates that are exclusively dependent upon the absent transporter for clearance from the blood. Consequently, this should lead to the accumulation of endogenous metabolite ligands of the deleted transporter in the plasma of each knockout animal (28,34,36,39). Comparison of such metabolomics data between the two knockout animals should theoretically allow for the elucidation of potentially unique endogenous ligands for each of these two SLC transporters. This is, in fact, the case with almost 100 endogenous metabolites being found that "uniquely" accumulate in the plasma of the Oat1KO (but not in the Oat3KO), whereas another nearly 50 metabolites uniquely accumulate in the plasma of the Oat3KO mice (but not the Oat1KO) (28,34,36,39).
Although some comparisons of the physicochemical features of endogenous metabolites interacting with these two multispecific transporters have been performed (28) and identified a few general features, a thorough computational and systematic analysis based on specific molecular properties/features that predispose endogenous metabolites to interact with OAT1 or OAT3 has yet to be carried out. This is important from both a physiological as well as a pharmaceutical perspective. Physiologically, OAT1 and OAT3 are major pathways through which the proximal tubule of the kidney interfaces with the rest of the body through the transport of numerous small organic molecules that function in normal metabolism, signaling, regulation of redox state, and the uremic metabolism of chronic kidney disease. If OAT1 and OAT3 have different structural preferences for metabolites and signaling molecules, this implies different effects on systemic metabolism, which has been implied by genome-scale metabolic reconstructions (36,40,41). Moreover, the possibility of differentially regulating OATs and other multispecific transporters that transport different sets of metabolites in various organs would also suggest, as described in the Remote Sensing and Signaling Theory, a mechanism for a relatively small number of transporters to exert major effects on metabolism in normal and disease settings (42).
In this study, cheminformatics and machine-learning techniques were combined to analyze the quantitative molecular properties of unique metabolites accumulating in vivo in the Oat1KO mice versus unique metabolites in the Oat3KO, with the goal of identifying a set of molecular properties (features) that enables clear and concise distinction of the physiologicallyrelevant molecules resulting from loss of either of the two transporters.
The analyses performed here provide a deeper understanding of the important physicochemical features driving endogenous ligand selectivity for OAT1 or OAT3 in vivo. Importantly, this analysis also connects OAT1 versus OAT3 unique endogenous metabolite physicochemical features to unique pathways fundamental to fatty acid, bile acid, amino acid, and peptide metabolism and signaling. The approaches employed here, combining multispecific drug transporter knockout metabolomics, cheminformatics, and machine learning to define molecular properties of endogenous ligands, can be applied to many other SLC and ABC drug transporters. This will not only help define the physiological roles of drug transporters but should also provide a new basis for metabolite-based drug design, tissue targeting of drugs, and analyzing drug-metabolite interactions. This is supported by our evaluation of the ability of the metabolitebased machine-learning model to predict OAT1-or OAT3selective drugs.

Results
Overview of strategy (Fig. 1) The goal of this study was to identify the molecular features/ characteristics determining ligand-substrate interaction with either OAT1 or OAT3; thus, we focused our attention on those metabolites that accumulate in the plasma of one of the knockouts but not the other (i.e. increased in the plasma of the Oat1KO but not the Oat3KO, or vice versa). In this way, 138 metabolites that appear to be unique for the Oat1KO (90 metabolites) or unique for the Oat3KO (48 metabolites) were identified (Figs. 1 and 2; Table S1). Although some overlap exists, OAT1-unique and OAT3-unique metabolites were unexpectedly found to have relatively unique effects on some biochemical pathways (Fig. S1). For example, ϳ8% of the unique OAT1 metabolites were found to be components of the pathway involved in the metabolism of ␥-glutamyl amino acids (although none of the unique OAT3 metabolites were involved in this pathway), whereas the metabolism of the primary and secondary bile acids comprises a substantial fraction of the OAT3-unique metabolites (although none of the unique OAT1 metabolites were involved in primary or secondary bile acid metabolism) (Fig. S1). Other OAT1-unique pathways included those for the degradation of valine, leucine, and isoleucine, as well as the metabolism of acylcarnitine and acylglycine fatty acids, which were found to be exclusive to OAT1-unique metabolites (Fig. S1). Other OAT3-unique pathways included the metabolism of arginine and proline (Fig. S1).

Cheminformatics approach to defining molecular properties of Oat1KO-unique metabolites and Oat3KO-unique metabolites
To define the molecular properties of the metabolites that would help distinguish unique metabolites of Oat1KO from those of Oat3KO, more than 60 physicochemical molecular properties ( Fig. 1; Table S2) were collected or calculated and then analyzed with a focus on their potential utility in interpreting the results from a biochemical and physiological perspective, for example, in developing a biologically interpretable Decision Tree or other type of classification of the relevant molecular properties.
The goal of these initial analyses was to narrow down this extensive list of molecular properties to a more manageable number that could be: 1) readily understood in the context of metabolite structures, and 2) interpreted in the context of known transporter biology, and which were, individually or in combination, capable of helping to distinguish Oat1KO unique metabolites from those unique to the Oat3KO. Molecular properties that seemed of particular interest from this viewpoint were visualized in multiple ways (e.g. scatterplot, histograms, distribution plots, and violin plots) (Fig. 3). For example, the distribution plot for the number of rings indicated that the Oat3KO-specific metabolites were greatly over-represented among the molecules with more than two rings ( Fig. 3 and Table 1). Although taurocholate is one of the classical transported substrates of OAT3 (43), and bile acids had among the greatest fold-changes among all metabolites in the Oat3KO, it is possible they may be overemphasized here due to over-representation on the targeted metabolomics platform and also due to the criteria used to identify unique metabolites (see "Experimental procedures").

Importance of a subset of molecular properties based on statistical and information metrics
Molecular properties were ranked according to information gain and contribution to model performance. The results indicated the likely high importance of such molecular properties as the number of rings (nof_Rings), the number of chiral centers (nof_Chirals), and the complexity in separating Oat1KO unique metabolites from Oat3KO unique metabolites present in this dataset (Fig. 4). Nevertheless, in different measures (e.g. FreeViz) (44), other molecular properties such as polar surface area (PSA) over the molecular area (PSA/area) appeared important (Fig. 4). The inclusion of such features turned out to be important for machine-learning methods like random forests and decision trees (see below).
Although in the iterative machine-learning process that followed (described below), the relative importance changed, at this point in the analyses the key molecular properties included the following: 1) the number of chirals (nof_Chirals); 2) the number of rings (nof_Rings); 3) the polar surface area relative to the total area (PSA/area); 4) an index of molecular complexity (Complexity); 5) a measure of molecular size (e.g. molArea, molVolume); 6) a measure of charge; and 7) the number of carbons with sp3 hybridization (c_sp3) (which is sometimes considered indicative of a molecule's three-dimensionality). By principal components analysis, three principal components were found to contribute 80% or more of the variance (Fig. 4). Later, other molecular properties not included in this original set of seven features, many of which are associated with modifications catalyzed by drug-metabolizing enzymes, were also found to be important for classification and substituted for several of the aforementioned molecular properties. These are discussed below. Figure 1. Schematic of workflow for "logical" identification of molecular properties for discrimination of unique OAT1 and OAT3 metabolites. A, OAs ultimately excreted by the kidney are cleared from the blood by the SLC transporters, OAT1 and OAT3, located in the basolateral membranes of proximal tubule cells of the kidney. Deletion of either of these transporters results in the plasma accumulation of OAs. Serum, obtained from WT and OatKOs, was subjected to untargeted, global LC-MS metabolomics analyses. B, resulting metabolomics data were used to identify Oat1 and Oat3 metabolites uniquely accumulating in each knockout mouse. Cheminformatics methods were used to identify over 60 molecular properties/features of the metabolites. C, data visualization and statistical analysis in Orange and Python libraries (Pandas, Matplotlib, Seaborn, and SQLAlchemy) of over 60 molecular properties for Oat1 and Oat3 unique metabolites were used to logically narrow down to a set of seven molecular properties to be used for machinelearning approaches to identify a set of features that classifies metabolites as uniquely Oat1 or uniquely Oat3.

Machine-learning analysis
Supervised machine-learning methods can be used to identify features that are important for classification. In this case, the cheminformatics analysis of metabolites uniquely accumulating in the Oat1KO and the Oat3KO initially resulted in Ͼ60 molecular properties (features) for each metabolite. Through the various visualizations and metrics already described, this set of features was "logically" narrowed down to Ͻ10 features, sets of which were then analyzed using the machine-learning methods described below to arrive at a set of seven features that were capable of classifying a metabolite as either OAT1 or OAT3 with about 75-80% accuracy.
Within the Orange environment, a number of machine learning approaches were applied, including Decision Tree, Random Forest, nearest neighbor classifiers, Naive Bayes, as well as Logistic Regression and Neural Network models ( Fig. 5; Table 2). Confusion Matrices were created (Fig. S2), and miscategorized instances were evaluated. Some of the machinelearning results were also "double-checked" by direct coding using the Python Machine Learning Library SciKit-Learn (44 -48).
In Table 2, the various Orange machine-learning classification accuracy, area under the curve (AUC) and other scores (leave-one out method) are shown for one of the highest-scoring sets of seven molecular properties: nof_Rings; nof_Chirals; PSA/Area; nof_Sulfates; nof_OH; C_RO; and Complexity. Among the 11 metabolites misclassified by one Random Forest classification run (using the seven final features), two had been flagged in exploratory data analysis with the larger set of features as outliers by the Orange SVM-based Outlier widgets (Table S3). The confusion matrices for several machine-learning models also revealed different sets of misclassified metabolites, despite often having similar scores on classification metrics, and those misclassified sets of metabolites had a somewhat different overlap with outliers. Thus, although we have mainly focused on Random Forest classification in what follows, these considerations emphasize the need to examine the Confusion Matrix for each model.
Although Random Forest classification generally gave the best score (Ͼ0.8 classification accuracy), Decision Tree classi- From these plots, one can see that OAT1-unique metabolites are generally smaller molecules with less complexity and fewer chiral centers and less ringed structures, whereas the OAT3-unique metabolites generally are larger and more complex molecules with more chiral centers and more ringed structures. See Table S1 for the full list of metabolites and Table S2 for the full list of molecular properties. Table 1 List of metabolites with 3 or more rings (indicating a preponderance of unique Oat3 metabolites)

Unique metabolic preferences of OAT1 and OAT3
fication often came close ( Table 2; Fig. S3), and it was generally the easiest to interpret in terms of those molecular properties favoring classification of unique Oat1KO metabolites versus unique Oat3KO metabolites. As expected, with sampling (i.e. different sets of 48 out 90 OAT1 instances) different trees were generated, some of which were complicated. One of the more straight-forward yet representative trees is shown in Fig. 6.

Importance of including phase I and phase II modifications by drug-metabolizing enzymes as molecular properties
Many of the metabolites accumulating in the two knockouts included presumptive modifications by phase I and phase II drug-metabolizing enzymes (DMEs), as well as other enzymes involved in sulfation, hydroxylation, acetylation, GSH conjugation, and phosphorylation (49,50). Phase I DMEs, which include members of the cytochrome P450 family, are important for redox modifications and hydroxylation. Phase II DMEs include acetyltransferases, glucuronosyltransferases, sulfotransferases, and other enzymes. Although these phase I and phase II DMEs are generally viewed as important in drug inactivation and/or making drugs more soluble (by adding polar groups) and thus more easily eliminated into the bile or urine, they also create many active endogenous metabolites and sig-naling molecules (49,50). For example, the phase II sulfation of indoxyl creates indoxyl sulfate, which can bind nuclear receptors and modulate kinases as well as functioning as an endogenous toxin (9,29,51). Hence, evaluating these DME modifications has the potential to yield additional biological relevance to the analysis. Although considering these molecular properties only slightly improved the classification accuracy over what is described above, the numbers of hydroxyl groups (nof_OH) and sulfates (nof_SO3H) were important for some models; this was consistent with their importance in measures of information gain and in FreeViz analysis (Fig. 7). Other modifications appeared less important based on data visualizations, as well as statistical and information metrics, and were not considered further.

Comparison with a "brute force" Random Forest approach to identifying key molecular properties (features) for OAT1 versus OAT3 classification
Using the Python machine-learning package SciKit-Learn (Fig. 5) (47), a brute force approach was used to identify seven molecular properties that, as features in the machine-learning analysis for classification by Random Forests, gave the highest classification accuracies (Fig. 8). As with the more "logical approach" using measures of feature importance and other A, bar graph of the ranking of molecular properties according to information gain in order to tease out importance of some molecular properties. The information gain for each molecular feature was normalized to the feature displaying the greatest gain (nof_Rings). B, principal component analysis reveals that first three components account for ϳ90% of the variance. C, FreeViz visualization of various molecular properties and their importance in separating out OAT1 versus OAT3. In the FreeViz graphical representation, the magnitude of each vector indicates the relative importance of each molecular feature as determined by the FreeViz algorithm, whereas the direction of each vector indicates the relative preference of that feature for OAT1 (blue background) or OAT3 (red background). In the representation, the blue circles depict the OAT1-unique metabolites, and the red crosses depict the OAT3-unique metabolites. The size of each symbol corresponds to the number of rings. For example, the large red crosses in the upper right-hand portion of the representation are OAT3-unique metabolites that have a large number of rings.

Unique metabolic preferences of OAT1 and OAT3
techniques already described, multiple combinations of seven features gave comparably high classification accuracies (Fig. 8).
(Using the Decision Tree algorithm in SciKit-Learn, the classification accuracy was generally slightly lower than Random Forests for the same features (data not shown).) Based on this brute force analysis, two points are to be emphasized. 1) Nearly all the features chosen using the aforementioned logical approach (nof_Rings, nof_Chirals, PSA/ Area, nof_SO3H, nof_OH, C_R0, and Complexity) could be found consistently, with Complexity appearing least often in the brute-force approach. 2) The aforementioned set of logically-chosen features (nof_Rings, nof_Chirals, PSA/Area, nof_SO3H, nof_OH, C_R0, and Complexity) scored nearly as well (within ϳ2% classification accuracy) as the highest-scoring random combinations of seven features (Fig. 8). This lends high confidence that the set of seven features finally settled on logically-and that captures multiple molecular properties known to be functionally important for OAT1 versus OAT3 interaction in experimental in vivo and in vitro biological systems and is thus more biologically interpretable-are about as good in terms of classification accuracy as the best-scoring set of seven features when a very large number of combinations are tested.

High-performing OAT1 versus OAT3 Random Forest classification model predicts OAT1-versus OAT3-interacting drugs with high accuracy
There has been much discussion in the literature about the relationship of small organic molecule drugs and metabolites (9,32,52,53). Thus, we sought to determine the effectiveness of the Random Forest model (using the set of seven molecular properties chosen using the logical approach), which classified unique OAT1 versus OAT3 metabolites with Ͼ80% accuracy, for the classification of drugs known to preferentially interact with OAT1 or OAT3 (Fig. 9). We therefore initially analyzed a set of 55 such drugs, 21 of which are known to preferentially interact by Ͼ2-fold with OAT1 in cell-based in vitro assays and 34 of which are known to interact with OAT3.
We hypothesized that the Random Forest model, trained using the unique metabolite dataset, would predict OAT1versus OAT3-interacting drugs significantly better than random. The ability of the (metabolite-based) model to correctly classify OAT1 drugs was only around 55-60%, whereas the ability of the model to correctly classify OAT3 drugs as such was ϳ70%.
However, when we examined which drugs were misclassified, we found that they were mainly ones that had a Ͼ2-fold but less than 5-fold preferential affinity for either OAT1 or OAT3 (Table S4). When we considered only drugs with a Ն5-fold preferential affinity for either OAT1 or OAT3, the data set was reduced to 33 instances, but importantly, the predictions of the unique metabolite-based Random Forest or Decision Tree models improved considerably; they were 75% accurate for unique OAT1 drugs and Ͼ80% accurate for unique OAT3 drugs (Fig. 9).

Discussion
Our results are of both physiological and pharmaceutical interest. OAT1 and OAT3 are well-established as in vitro drug and toxin transporters (5,42). Some of these in vitro findings have been supported in vivo in the Oat1KO and the Oat3KO mice or in ex vivo assays utilizing knockout tissues. For example, the role of one or both of the OATs has been demonstrated in the handling of diuretics, ␤-lactam antibiotics, and antivirals, as well as environmental toxins such as organic mercury conjugates and aristolochic acid (23, 54 -62). In addition, a growing number of studies have pointed to the physiological roles of OAT1 and OAT3, including urate homeostasis, creatinine, endogenous signaling, and metabolism, uremic toxin, and uremic solute handling, and blood pressure regulation (14, 27, 28, 34, 36, 39 -41, 54, 58, 63-69).
It is often held that OAT1 and OAT3 have largely overlapping substrates, and thus the two transporters have been sometimes suggested to be "redundant" when expressed in the same cell type, such as the kidney proximal tubule cell or the choroid plexus cell (5,9). This idea is mostly based on in vitro assays of binding or transport in cells overexpressing OAT1 and OAT3; furthermore, most of the substrates tested have been drugs (7). However, our extensive metabolomics analysis of the Oat1KO and Oat3KO mice have revealed very different sets of metabo-

Unique metabolic preferences of OAT1 and OAT3
lites, i.e. aerobic metabolism intermediates, signaling molecules, vitamins, gut microbiome products, fatty acids, bile acids, and odorants, accumulating in the plasma of the two knockout mice due to the lack of renal OAT-mediated transport from the blood to the urine (28,34,36,39,41,43,63). In our study of metabolites uniquely accumulating in vivo due to the inability to produce functional OAT1 or OAT3 transporter, differences were initially analyzed by statistical and informational metrics and then using machine-learning tools. A previous study that included a few of the approaches used here (although with machine-learning algorithms not based in the Python SciKit-Learn package) compared the unique FDAapproved drug substrates of OAT1 and OAT3 (33). Consistent with prevailing "textbook" views, this previous study revealed considerable similarity in the molecular features (properties) of pharmaceutical drugs interacting with OAT1 and OAT3, although one molecular property was found to separate OAT1 from OAT3-interacting drugs-a greater propensity for OAT3 to interact with cationic drugs over OAT1, a fact supported by in vitro analysis of certain drugs (33).
However, it is possible that in vitro studies do not fully reflect in vivo reality. It is currently not feasible to analyze the behaviors of hundreds of drugs in the Oat1KO and Oat3KO animals, and in many cases, the behavior of FDA-approved drugs may not reflect the in vivo behaviors of metabolites, signaling molecules, antioxidants, and other known endogenous ligands of OAT1 and OAT3. Moreover, the OAT-transported FDA-approved drug dataset may over-represent certain classes of medically-effective molecules (e.g. NSAIDs) and/or be nonrepresentative of the relevant chemical space because of factors related to drug design and patenting, the approval process, and commercial issues.
In contrast, the use of metabolomics data offers a comparatively-efficient approach to the problem without the aforementioned biases that might accompany drug datasets. Because analysis of the endogenous substrates/ligands should allow one to identify the "true" molecular features driving ligand interaction for OAT1 or OAT3, and metabolites that accumulate in the plasma of mice bearing deletions in either OAT1 or OAT3 likely represent endogenous substrates/ligands of the transporters (unlike molecules in common drug datasets), the utilization and analysis of in vivo metabolomics data appear to be a reasonable approach to characterize the unique molecular properties/features driving in vivo interaction of the metabolites with the two transporters. Nevertheless, to obtain a large number of relatively unique metabolites to perform a machinelearning analysis, we had to combine results from several studies; these targeted and untargeted studies were done at different  . Importance of phase I and phase II modifications by drug-metabolizing enzymes as molecular properties. Bar graph showing various DME modifications ranked according to information gain (normalized to the DME modification displaying the greatest information gain (nof_OH)). The number of hydroxyls (nof_OH) and number of sulfates (nof_SO3H) are clearly the two highest-ranked attributes contributing to the classification.

Unique metabolic preferences of OAT1 and OAT3
times, using different platforms or methods, and calls were sometimes made based on different criteria. Although this is not ideal, some of these issues are unavoidable with drug transport data as well, where the pooled data come from multiple labs using somewhat different methods.
Nevertheless, we note that among the Oat1KO and Oat3KO unique metabolites, many molecules of similar structure to these in vivo molecules have been tested in in vitro binding assays or transport assays (7,42). For example, taurocholate is considered a prototypical in vitro-transported substrate of OAT3.

Machine-learning considerations and potential pharmaceutical relevance of findings
As we have shown, our combined cheminformatics and machine-learning approach gave very good classification accu-racy for unique Oat1KO metabolites versus unique Oat3KO metabolites. We were able to ultimately identify a limited set of seven critical molecular properties/features that proved quite useful in classifying in vivo endogenous metabolites as substrates/ligands of OAT1 or OAT3 with high accuracy.
Statistical and informational metrics, as well as data visualizations, allowed us to reduce the Ͼ60 molecular properties to Ͻ10 molecular properties that seem to be key in determining the transporter specificity of the ligand. For example, it was possible to reduce the number of molecular properties because of high correlations (e.g. r Ͼ0.9) with other molecular properties. Our goal was to reduce the set of features to 10% or less of the number of total instances. Thus, from a set of several highly-correlated molecular properties, generally only one was chosen.  . Prediction of unique OAT1 and OAT3 drugs. A, ability of a strong machine-learning model for classification of unique Oat1 and unique Oat3 metabolites to predict unique Oat1 and unique Oat3 small molecule drugs was analyzed. B-D, seven molecular features that were found to perform well in the classification of metabolites as either OAT1-unique or OAT3-unique were tested for their ability to correctly classify drugs known to interact with either OAT1 or OAT3 with reasonable specificity in random forest and decision tree approaches. Random forest (B) and decision tree (C) classification models were used to predict OAT1 and OAT3 drugs with Ն5-fold (B and C) or Յ2-fold (D) preferential affinities for OAT1 or OAT3. Consideration of drugs with a Ն5-fold preferential affinity for either OAT1 or OAT3 resulted in predictions of the unique metabolite-based random forest or decision tree models with Ն75% accuracy for unique OAT1 and OAT3 drugs, whereas Յ2-fold preferential affinity had around 50% accuracy.

Unique metabolic preferences of OAT1 and OAT3
In the process of defining the key features, we also found that simple feature engineering-such as the use of PSA/molecular area instead of PSA, often led to improved classification accuracy using the machine-learning tools. Database queries were sometimes helpful for examining the biochemical nature of metabolites that were being captured by one or more molecular properties. For example, a search for molecules with higher numbers of rings yielded primarily Oat3KO metabolites; among these were flavonoids, bile acids, or conjugated sex steroids ( Table 1).
The combined data analysis indicated that Oat3KO metabolites had more rings, more chiral centers, and greater complexity than Oat1KO metabolites ( Fig. 3; Table 1). Similar analysis of the Oat1KO metabolites supports the view that Oat1 substrates have a higher PSA/area and more carbons outside of ring structures. Furthermore, by evaluating the importance of phase I and phase II DME modifications of metabolites as separate features (e.g. hydroxyl groups and sulfate groups) for machinelearning classification and finding that, at least in certain machine-learning models, one or both of these molecular properties affect classification accuracy, we provide additional biological context. Thus, at least for Oat1 and Oat3, the endogenous substrates should not be considered only from the viewpoint of their interaction with transporters but also in the context of modifications by DMEs.
Decision Tree classification and k-nearest neighbor classification were often nearly as good as Random Forest classification. Moreover, an independent brute-force approach was also used to randomly identify sets of seven molecular properties that could be used to classify the metabolites according to their preferred transporter. The classification accuracies of the best sets (out of Ͼ10∧5 analyzed) was similar to that obtained with the logically selected molecular properties as described above. Together, the results support the view that metabolites uniquely accumulating in vivo in the Oat1 and Oat3 knockouts can, indeed, be separated using a limited set of relatively independent molecular properties that seems to capture the diversity of unique metabolites accumulating in the two Oat knockouts in vivo.
We also asked whether the Random Forest classification model predicting unique OAT1 and OAT3 metabolites would be useful for classification of drugs known to preferentially interact with OAT1 and OAT3. There is a body of literature supporting the view that drugs might act like metabolites with similar structural features during transport (9, 32-34, 52, 53). In the case of the OATs, which, to our knowledge, are the drug transporters for which the most in vivo metabolite data are currently available, this appears to be the case, at least for relatively unique drugs known to interact with either OAT1 or OAT3 with a 5-fold or greater difference in affinity measures. For this set of drugs, Ն75% were correctly classified as either OAT1 or OAT3. Nevertheless, the number of drugs that fulfilled this criterion was only 33, or roughly one-third the size of the unique metabolite data set used to generate the Random Forest classification model. Thus, although the ratio of features to instances in the unique metabolite-based model was 1:14, the ratio of features to instances for the drug predictions was 1:5. Thus, it is possible that the predictions are somewhat overde-termined by the seven feature-based metabolite model. That said, with a set of drugs that had a less than 2-fold difference in affinity between OAT1 and OAT3, indicating a similar selectivity for OAT1 and OAT3, the model was unable to classify them, essentially making random calls. These results are what would be expected for a good model. Additional confidence comes from the fact that the Random Forest classification model that was successful in making drug calls was built from unique in vivo metabolites accumulating in knockout mice while the unique OAT1 or unique OAT3 drugs were identified based on in vitro transport assays. Of course, the metabolites are endogenous molecules, whereas the drugs are generally FDA-approved chemical compounds produced by pharmaceutical companies. So, the metabolite and drug datasets are independent from more than one viewpoint.
Identification of the similarities and differences in molecular properties between unique Oat1KO metabolites and Oat3KO metabolites can be helpful, particularly in light of tissue expression patterns, for predicting the likely route of distribution or elimination for a new compound, especially if it is similar in structure to metabolites. Furthermore, phase I and/or phase II DME modifications (i.e. hydroxylation and sulfation) appeared important in determining whether a particular endogenous molecule is handled by OAT1 or OAT3; this kind of information may be helpful in determining the fate of drugs (OAT1 versus OAT3 elimination) that undergo DME modifications. It follows that this type of analysis should help with the design of drugs that, because they have similar properties to particular sets of endogenous metabolites, can be targeted to various tissues through the drug transporters expressed and/or targeting drugs for elimination via a specific transporter.

Physiological relevance
Although more optimal drug design, tissue targeting of drugs, and a better understanding of drug-metabolite interactions are obvious practical applications of our results, the preceding extensive discussion on the machine-learning approach and the pharmaceutical relevance of the results does not take away from the potential physiological importance of our work.
The most significant biological result of our study, in the context of previously published knockout metabolomics and metabolic reconstructions of OAT1-or OAT3-regulated biochemical pathways (9,42), is the general debunking of the view that OAT1 and OAT3 are redundant when expressed in the same tissue. This is clearly not true from a physiological standpoint and suggests that differential expression of OAT1 and OAT3 in tissues like the pancreas, retina, blood-brain barrier, and also during kidney development and regeneration means different regulation of local and/or systemic metabolism and signaling. Our results connect physiochemical features of endogenous metabolites uniquely changed in vivo in the OAT1 versus OAT3 knockouts to unique metabolic and signaling pathways regulating specific sets of fatty acids, bile acids, amino acids, and peptides.
The data presented here, based on in vivo metabolomics data, provide clear evidence to support the view that while there is overlap in the types of (bio)chemical structures interacting with OAT1 and OAT3, that overlap is not as great as believed previ-

Unique metabolic preferences of OAT1 and OAT3
ously; OAT1 and OAT3 handle distinct sets of metabolites involved in different biochemical and signaling pathways. These results, taken together with the other aforementioned studies, challenge existing summaries of the roles of OAT1 and OAT3 presented in numerous pharmacology and toxicology textbooks and review articles. We have been able to leverage the fact that hundreds of metabolites have been found to accumulate in the two Oat knockouts, presumably reflecting both unique, as well as common, physiological roles of OAT1 and OAT3. Assuming similar evidence regarding the true endogenous physiological substrates is found for other multispecific ABC and SLC transporters still generally thought to primarily function as transporters of drugs, it could cumulatively call into question the very notion of drug transporters (9).

Animals
All experimental protocols were approved by the University of California San Diego Institutional Animal Care and Use Committee (IACUC), and the animals were handled in accordance with the Institutional Guidelines on the Use of Live Animals for Research. Adult (n ϭ 5) WT, Oat1KO male mice were housed separately under a 12-h light/dark cycle and were provided access to food (standard diet) and water ad libitum.

Metabolomic analysis, compound identification, quantification, data curation, and statistics for the Oat1KO
Individual, unpooled serum samples obtained from adult male wildtype (WT; control) and Oat1KO mice were stored at Ϫ80°C and shipped on dry-ice to Metabolon (Durham, NC) for preparation and metabolomic profiling analysis as described previously (28,63).
In this analysis of the Oat1KO, a total of 731 metabolites of known identity were detected and identified utilizing a reference library of chemical standard entries using software developed at Metabolon, as described previously. Quantification and statistical analyses were performed by Metabolon as described previously (36,63).

Data collection from previous metabolomics analyses
Considerable targeted and untargeted metabolomics data from the Oat1KO and Oat3KO mice have been previously published by us (28,34,36,39,54,63,70). Although these older previous studies utilized different platforms, different reference libraries, and slightly different protocols (e.g. the Oat3KO metabolomics analyses included data from both male and female knockout mice), for the machine-learning analyses performed here, the available data from previous and the current metabolomics studies were combined to enable relative quantification of over 500 metabolites with known identities (28,36,63,71).
In this case, each metabolite was compared across the various platforms employed, and only those metabolites present on all of the various platforms were considered. From this list of common metabolites, those unique to a particular Oat knockout were initially defined as those displaying significant (p Ͻ 0.05) increases in plasma concentration in one or the other of the knockout mice (e.g. significantly increased in the plasma of the Oat1KO, but not in the Oat3KO). In this way, a much larger number of unique metabolites (ϳ90) were present in the Oat1KO at a significance of p Ͻ 0.05, whereas a lower number were uniquely present in the Oat3KO at this level of significance. Therefore, to augment the number of unique Oat3KOinteracting metabolites, those compounds displaying accumulation in the serum of the Oat3KO (compared with the WT), but not the Oat1KO, with a significance of 0.05 Ͼ p Ͻ 0.1 (trending toward significance) or a particularly high foldchange with the inability to meet significance criteria apparently due to an anomalous value were included; this resulted in a set of 48 unique OAT3-interacting metabolites.
Although there were an unequal number of these unique metabolites between the two knockouts, the numbers of unique Oat1KO metabolites and Oat3KO metabolites were kept balanced in training sets for machine-learning application. The Data Sampler widget in Orange (44) was used for randomly sampling 48 metabolites from the 90 total Oat1KO metabolites; thus, the total number of metabolites used in the machinelearning classification analyses was 96 (48 unique Oat1KO and 48 unique Oat3KO).

Generation of molecular properties
A set of 67 structural and physiochemical properties of identified OAT1 and OAT3 metabolites were calculated with tools found in the computational environment of the commercially available computational chemistry software, ICM (version 3.8 -6) (Molsoft LLC, San Diego, CA). Complexity values were obtained from PubChem, which describes "complexity" as follows. "The complexity rating of a compound is a rough estimate of how complicated a structure is, seen from both the point of view of the elements contained and the displayed structural features, including symmetry. This complexity rating is computed using the Bertz/Hendrickson/Ihlenfeldt formula (58)."

Elimination of highly-correlated molecular properties to reduce a number of features for machine-learning analyses
An initial evaluation of the molecular properties revealed highly-correlated descriptors. For example, a number of molecular properties, such as molWeight, molVolume, molArea, and number of atoms, are highly correlated with each other, essentially providing different but related measures of correlated properties. Pairwise correlation coefficients for the molecular features were determined, and a number of highly cross-correlated features were eliminated. In this way, the list of more than 60 molecular properties was reduced to a smaller number (ϳ25) for the size of the number of molecules in the training set and from the viewpoint of visualization and data analysis.

Machine learning
Orange, versions 3.13 and 3.16, was used. The Orange software package uses Python libraries, including Scikit-Learn, numpy, and scipy (45). A series of analyses, including PCA, ranking based on information gain, FreeViz (44), outlier analysis, and machine-learning analyses, i.e. k-means classification, decision trees, random forests, neural networks, Naive Bayes,

Unique metabolic preferences of OAT1 and OAT3
and logistic regression, were performed. Other, mostly confirmatory and complementary, machine-learning analyses was performed using the Python-based package SciKit-Learn (47). Default parameters were generally kept.

Data visualization and SQL queries
Data visualizations were mostly done in Orange (distributions, scatterplots, and FreeViz). Additional data visualizations were done in Python using a Jupyter notebook with the pandas, matplolib, and seaborn libraries imported. For SQL queries, Excel or csv files were imported into SQL Studio for data analysis, or the queries were coded in Python using SQLAlchemy.

Identification of optimal seven properties with highest random forest classification accuracies by a systematic enumeration of property combinations
First, the set of 138 metabolites was divided into a training set of 48 each (OAT1, OAT3) and resampled test set of 42 each. All possible combinations of features were run, and the top 2500 AUC were recorded. Then, we took each set in this 2500 and used k-fold cross-validation as our metric for accuracy from the SciKit-Learn library. We could not simply run k-fold on the entire set of combinations due to computational limitations. To find the ten best sets of seven features, the classification accuracy along with its standard deviation was found by re-sampling k-fold-cross-validation 200 times to address variance from the random sampling. (By testing over 50 to 2000 iterations, the standard deviation remained at about 1.5%, thus providing confidence that these were consistent scores.) The highest average Random Forest scores were then chosen as the best feature sets.

Brute-force approach with SciKit-learn Random Forest classifier
Of the many numeric molecular properties previously described (Table S2), 25 properties were selected by selecting single representatives from several highly correlated properties. These 25 were then used as the "pool" of features from which sets of seven features were randomly chosen for a brute force analysis. In this way, all possible combinations of seven features from the list of 25 attributes (out of 480,700 possible combinations) were tested using the Random Forest classification algorithm in SciKit-Learn, and the classification accuracies were determined.

Predictions of unique OAT1 and unique OAT3 drugs by the unique OAT1 versus unique in vivo metabolite-based classification model
The metabolite-based Random Forest classifier model built, using the seven features described in the text that were able to correctly classify unique OAT1 versus unique OAT3 metabolites Ͼ80% of the time, was used to test whether unique OAT1 and unique OAT3 drugs could be predicted. The model was applied using the Predictions widget in Orange. As described under "Results," the final drug dataset used consisted of drugs known to have Ն5-fold differences in affinity for OAT1 versus OAT3 when analyzed in vitro by cell-based assays. As a control, we used a drug dataset where there was no difference or minimal (Ͻ2-fold) difference in affinity between OAT1 and OAT3.