Sequence-based Network Completion Reveals the Integrality of Missing Reactions in Metabolic Networks

Background: Genome-scale draft metabolic networks are incomplete, even for well studied organisms. Results: Reactions selected by minimizing flux through unlikely reactions resulted in networks of superior quality. Conclusion: Genome-scale models have many network completion solutions but require the addition of unsupported reactions to be functional. Significance: Metabolicnetworksguidesyntheticbiologyefforts,andthequalityofnetworksdeterminestheirpredictivepower.

Metabolic network reconstructions are instrumental in aggregating metabolic knowledge about organisms (1)(2)(3). See "Glossary" for explanations of the technical terms pertinent in our field that we use in this article. Network reconstructions have steadily grown in size, reflecting increasingly comprehensive genome annotations (4 -7). In addition, reconstructions have grown in complexity. Current reconstructions contain detailed gene-to-protein-to-reaction mappings, thermodynamic constraints, and, in some cases, signal transduction layers (8 -10). The most sophisticated reconstructions have been extensively curated (6,11,12), but draft reconstructions are now mostly machine-generated or machine-assisted (7,(13)(14)(15)(16). The Model SEED uses annotations from the "Rapid Annotation Using Subsystems Technology" (RAST) 2 web service (17,18) as part of a network reconstruction pipeline for prokaryotes (15). In addition to a starting point for curated reconstructions, draft metabolic networks facilitate interpretation of the metabolic capabilities of newly sequenced organisms or communities of organisms (7,19,20).
Metabolic networks are reconstructed in a bottom-up fashion from identified genes following genome annotation (21). Knowledge of metabolic pathways can guide gene annotation, as implemented by the Pathway Tools software (22). Similarly, RAST simultaneously annotates genes that are part of a metabolic subsystem (18), utilizing mutually corroborating information on genes involved in closely related metabolic processes. As a logical extension of the subsystem approach, networkwide mutually corroborating information may be used to guide reconstructions. An application of this concept is to require draft networks to be able to carry out the production of all essential cellular building blocks, collectively referred to as biomass, from a well defined medium source (23).
Metabolic networks resulting from assembling all reactions inferred from gene annotations (draft networks) are currently unable to describe the synthesis of all biomass components. Draft networks contain gaps, isolated reactions, and reactions that cannot carry flux under any circumstances (1). Although isolated or blocked reactions are easily identified (24), it is not obvious whether they result from under-annotation or overannotation. Hence, an isolated reaction may need to be connected through additional reactions that were under-annotated in the draft network, or the isolated reaction resulted from a spurious annotation. Gaps in the network pose the opposite problem; although a network can be readily completed to enable production of all biomass components (25), the location of the actual missing reaction may be illusive. The appearance of gaps in metabolic networks is not exclusively the result of under-annotation. Incorrect reaction reversibility assignments (thermodynamic constraints) (26) or stoichiometric constraints resulting from dead-end metabolites may also prevent production of biomass components (25,27). Last, part of the biochemistry of an organism may not have been associated with genes, or the biochemistry may be yet to be discovered. Adding reactions to fill these gaps is known as "gap filling" and has been the subject of considerable inquiry and has been reviewed in detail elsewhere (24).
Commonly, mixed integer linear programming (MILP) optimization is used to perform bottom-up gap-filling (27). In this approach, reactions are iteratively added until production of biomass becomes feasible, often while minimizing the number of reactions required (15,25,27). Several other optimization strategies have been reviewed here (28). In the case of the Model SEED, reactions are prioritized based on their nature. For instance, adding an internal reaction incurs a lower cost than adding a transporter. Bottom-up gap filling works well for well annotated genomes, but for networks that require extensive gap filling, a top-down approach is more robust (29). In the more recently developed top-down methods, all gap-filling reactions are added, followed by the successive preferential removal of unneeded gap-filling reactions with little or no sequence similarity in the genome of the organism for which the network is reconstructed (14,29,30). Prioritization of the removal of reactions without sequence similarity minimizes the inclusion of locally (enzymes with an associated sequence that is not present in the target genome) and globally (reactions without sequence association) orphaned reactions. Very recently, a bottom-up MILP approach also used sequence similarity as a likelihood metric for the existence of a gap-filling reaction in the target genome (31). Gap analysis itself has been used to identify knowledge gaps in human metabolism (32) and to leverage contextual information of networks to hypothesize gene function (33).
This work investigated the need for adding gap-filling reactions to draft networks, the extent for which sequence similarity to enzymes can be found for these reactions, and how orphaned enzymes influence gene essentiality predictions by metabolic networks. To assess the presence of sequence support for gap-filling solutions, new linear programming (LP)and quadratic programming (QP)-based gap-filling problems were formulated that minimize the utilization of unsupported reactions. All gene sequences associated with the Model SEED gap-filling reaction database (11,858 reactions, received on April 20, 2012) (15) were queried against four prokaryotic genomes, and unique gap-filling solutions were retrieved that minimized the utilization of unsupported reactions. Unlike recently reported BLAST-weighted MILP-based work (31), the networks resulting from our approach outperformed networks gap-filled by the Model SEED (15).

Experimental Procedures
Metabolic Networks, Biochemistry Database, and Gene Annotations-Metabolic networks for Streptococcus pneumoniae, Bacillus subtilis, Escherichia coli MG1655, and Acinetobacter baylyi ADP1 were downloaded from the Model SEED web site on May 3, 2013 (15) along with medium conditions and biomass formulations. The Model SEED gap-filling biochemistry database, experimental gene essentiality results, and associated medium formulations were kindly provided by Chris Henry (Argonne National Laboratory, Lemont, IL). Gene annotations for 891 prokaryotic species were downloaded from the RAST sapling server (34), totaling 690,445 genes encoding 7,218 functional roles. The biochemistry database maps genes to reactions through the use of functional roles and enzyme complexes made up of functional roles (15). Table 1 includes a summary of the size of the downloaded Model SEED database and draft metabolic networks.
Identification of Functional Roles-For each functional role in the biochemistry database, a BLAST amino acid database was generated using all protein sequences associated with that particular role. The complete genomes of target organisms were queried against each functional role BLAST database using BLASTX (35) with the BLOSUM62 scoring matrix (36). The E-values for the best BLAST high-scoring segment pairs from each functional role database query were used to weight biochemical reactions. E-values were chosen because they are comparable between different calls against distinct functional role databases and correct for multiple comparisons by penalizing the score by both the length of the enzyme database and the length of the target genome (37). Only the lowest E-value was recorded. To adjust the weights for each enzyme complex independently, duplicate reactions were created so that each complex had an independent mapping with a reaction. Reactions were weighted with the geometric mean of the E-values for the constituent roles of an enzyme complex. This treats the E-values as probabilities in determining the support for the existence of an enzyme, which is here defined as enzyme sequence support (ESS). Reactions with an ESS of less than 1.0EϪ240 were set to the value of 1.0EϪ240. Reactions were weighted by the logarithm of the ESS values of the associated enzymes, where W R is the weight for a reaction, E R is the ESS for a reaction, and E min is the minimum E-value. This formulation results in small weights for well supported reactions relative to unsupported reactions while constraining the weights to a smaller numerical range, which improved the numerical stability of the LP and QP solver software.
Gene Essentiality and Metabolite Production-Flux balance analysis (FBA) (38) was used to check for the existence of a synthesis route for individual biomass components. A gene was classified as computationally essential if removing the reaction(s) uniquely associated with an enzyme complex resulted in a network that could not carry flux greater than 1.0EϪ6 to biomass. Similarly, an individual metabolite was classified as producible if a flux solution could be found that carried a flux Ͼ1.0EϪ6 of the tested metabolite through an export reaction that was temporarily added for testing purposes (25).
Gap-filling Algorithm-The BLAST-weighted LP gap-filling algorithm was formulated as follows, where w is a column vector of weights (Equation 1), and v is a column vector of reaction fluxes including separate terms for forward and reverse reactions. The stoichiometric matrix (S) relates reactions to metabolites through stoichiometric coefficients. A negative value in the stoichiometric matrix specifies a metabolite that is consumed by a reaction, and a positive value describes the production of a metabolite by a reaction. S has dimensions m (metabolites) by 2n (n reactions, 2n for both directions). The constraint Sv ‫؍‬ 0 enforces that all metabolites have a net balance of production and consumption, known as a mass balance constraint. v max is a vector of upper bounds on reaction fluxes, v bio is the required flux through the biomass reaction. Similarly, a weighted QP gap-filling algorithm was formulated as follows, where W is a diagonal matrix of the weights. Other terms were identical to the LP formulation, except that reactions did not need to be divided into forward and reverse directions and were constrained directly using two vectors: v min and v max . The QP formulation results in fluxes that minimize the sum of weighted squared fluxes, which effectively distributes fluxes across available biomass routes inversely proportional to the weights and number of reactions of a given route.
Software-LP and QP problems were solved with CPLEX TM (IBM, Armonk, NY). LP problems were solved using the dual simplex solver to minimize constraint violations. Custom Matlab TM (Mathworks, Natick, MA) and Python scripts were used for the preparation of matrices and databases. Producibility of metabolites from medium components was tested by FBA using the COBRA Toolbox (39).
BLASTX comparisons for the four target genomes were run on a commodity quad core Intel i3 desktop computer, taking roughly 8 total h to complete. Further processing of the BLASTX output using custom Python scripts required approximately 4 h of computation time.

Results
Metabolic Networks Require Gap Filling-Draft metabolic networks for S. pneumoniae, B. subtilis, E. coli MG1655, and A. baylyi ADP1 were downloaded from the Model SEED. Corresponding experimental gene essentiality results of genomescale single gene knock-out libraries (40 -43) along with mappings to the model genes were provided by the Model SEED upon request in October 2011. The four organisms were selected because the associated gene knock-out libraries were generated through full-length gene deletion methods rather than transposon insertion methods. Transposon knockouts can display complex gene knockdown behavior that complicates the interpretation of gene essentiality predictions (44). The gap-filling reactions added by the Model SEED were stripped from the downloaded models. In addition, all genes not directly associated with metabolic reactions were not evaluated to limit gene essentiality evaluations to precursor metabolism only.
The draft networks now contained gaps resulting from under-annotation of the genome, incorrect reaction reversibility constraints resulting from inaccurate Gibbs free energy estimates, and stoichiometric constraints caused by dead-end metabolites. Consequently, there were three approaches to gap-fill metabolic networks by addressing each of the three causes. To test whether removal of thermodynamic constraints alone could enable biomass production, all reactions were made reversible. The existence of a route to biomass was tested using FBA by maximizing flux through the biomass reaction. No such route existed for any of the four networks, demonstrating that removal of thermodynamic constraints alone was insufficient to gap-fill the tested networks. Removal of stoichiometric constraints caused by dead-end metabolites by allowing all metabolites to leave the network was also insufficient. Furthermore, a combination of relaxing thermodynamic constraints and allowing metabolites to leave the network (all reactions in the metabolic network were made reversible, and all metabolites could leave the network) also did not result in feasible biomass production in the networks. Hence, the addition of reactions to all tested metabolic networks was necessary.
Network Completion Requires Reactions with no ESS-For all further gap-filling approaches, the reactions from the Model SEED gap-filling database (11,858 reactions; Table 1) were used as candidates for gap filling (Fig. 1). The database included a subset of curated transport reactions and had been thermodynamically constrained using the group contribution method (45). The sequences of the RAST-annotated genes of all organisms in the Model SEED database were extracted, and a sequence database was generated for each gene. Using the RAST mapping between genes, functional roles, enzyme complexes, and reactions, an organism-specific weight (Equation 1) was calculated for each reaction of the gap-filling database ( Fig.  1; see "Experimental Procedures" for details).
To test whether networks could be completed by restricting incorporation of reactions with a predefined level of support, reactions were divided into three tiers: highly supported reactions (ESS of 1.0EϪ240), significantly supported reactions (30) (ESS Յ 1.0EϪ10), and unsupported reactions (ESS Ͼ 1.0EϪ10). For each tier, all reactions were added to the base models, and FBA was used to evaluate whether biomass could be produced. This revealed that no networks could be completed with only highly supported reactions or even significantly supported reactions ( Table 2). This suggested that the tested networks required locally orphaned enzymes (no similarity to known enzymes in the organism) or globally orphaned enzymes (no known sequence) to produce all biomass components. Hence, orphaned metabolic functionality was integral to the core of metabolic networks and included reactions essential to biomass production. However, after releasing all thermodynamic con-straints, including those in the gap-filling reactions and stoichiometric constraints caused by dead-end metabolites, networks containing only significantly supported reactions were able to produce all biomass components. The role of thermodynamic constraints in network completion was investigated in detail and will be reported in a specialist journal.
Non-producible Biomass Metabolites Are Distributed across Metabolism-The four tested organisms were unable to produce a significant portion of their biomass components, in each case spanning multiple classes of metabolites (Figs. 2 and 3). Some of the biomass components in S. pneumoniae, B. subtilis, and A. baylyi that were not producible in the base models were producible with the models that were augmented with the strongly supported reactions only. This suggested that the original networks were under-annotated. This was especially true for S. pneumoniae, which could not produce half of its biomass metabolites (39 of 79), yet 33 metabolites could be produced solely using significantly supported reactions ( Fig. 3 and Table  2). Only 6 -8 biomass metabolites could not be produced with only supported reactions in each organism. The ability of the augmented networks to produce often many more biomass components than the base models, even if only the highly supported reactions were used, suggests that there was sufficient potential for ESS values to guide gap-filling solutions.
Two biomass components, acyl carrier protein and peptidoglycan polymers, could not be produced by any of the organisms. Spermidine and thiamine pyrophosphate (TPP) could also not be produced by any organism but were imported from the media by S. pneumoniae (Fig. 2). The Gram-positive bacteria S. pneumoniae and B. subtilis required the cell wall precursor glycerol teichoic acid. Acyl carrier protein, peptidoglycan polymers, calomide, and glycerol teichoic acid could be produced in isolation with significantly supported reactions if the biomass reaction was replaced by independent export reactions for all biomass components. However, acyl carrier protein, peptidoglycan polymers, and calomide biosynthesis were not required for total biomass production in the models because their precursors were regenerated by the biomass equation itself. All other non-producible metabolites are discussed in detail below.
Note that not all biomass components were necessarily essential; for instance, spermidine is part of the canonical E. coli biomass equation but may not be essential (5,46). In some cases, genetic evidence may support the classification of a metabolite as essential if a sole pathway synthesizes the metabolite; riboflavin is one such example. In the biosynthesis of riboflavin in E. coli, all genes associated with riboflavin biosynthesis FIGURE 1. Gap-filling algorithm. Weighted biochemistry databases were generated for target organisms by comparing the target genomes with functional role-specific BLAST databases for each known enzyme functional role in the RAST database. The best HSP returned from each database search was translated into a weight value for the reactions associated with the enzyme function. LP was used to select an optimally supported gap-filling solution from the weighted database, and QP was used to identify a space of possible gap-filling solutions.

TABLE 2 Biomass components that require gap filling
To investigate which biomass metabolites required gap-filling, FBA was used to maximize the export of each individual biomass component, given the stoichiometric and thermodynamic constraints imposed on the network. An exchange reaction was added for each biomass component that was tested, and FBA was used to maximize flux through each component exchange reaction in turn. Metabolites that could not be exported at a flux greater than a numerical cut-off of 1.0EϪ6 were considered non-producible. Production of the individual biomass components was attempted using gap-filling reaction sets with three different levels of support as well as the base models with no gap-filling reactions. were experimentally essential (40), suggesting that riboflavin is indeed an essential biomass metabolite. It was surprising that the complete synthesis of riboflavin required unsupported reactions, despite the final steps in riboflavin synthesis being present in the metabolic network (also see below). BLAST-weighted Gap Filling-A weighted LP problem was formulated to incorporate reactions into gap-filling solutions depending on their ESS. Each reaction in the gap-filling database was weighted inversely proportionally to the associated ESS scores (see "Experimental Procedures"). The improbability (approximated by 1 E-value) of a sequence similarity score occurring by chance was treated as the level of support for an enzyme activity existing in the network. This simplification was vulnerable to detecting false positives caused by strong similarity to a short sequence or domain only, but assessment at network level immunizes this approach to most effects of false positives. Hence, for an incorrect pathway to be included, all reactions in a pathway would have to be false positives. Treating support for a reaction as a probability, the support for a pathway was expected to scale with the product of the support of the underlying ESS values. Sequence support for the existence of a pathway of n reactions may then be described as the product of the ESS values for the individual reactions: ⌸ i ϭ 1 n . To avoid penalties against existing reaction annotations, reactions already included in the draft network received a weight of zero. The linear and quadratic programming objectives were minimized while requiring a set flux through the biomass reaction (see "Experimental Procedures"). Utilization of LP and QP made gap filling very fast. On a typical desktop computer, solutions were retrieved in seconds, compared with minutes for MILP.

Biomass
Quadratic Programming Reveals the Gap-filling Solution Space-The QP formulation of the weighted gap-filling algorithm minimized the squared sum of weighted reaction fluxes. Squaring the weighted reaction fluxes limits large fluxes because penalties increase quadratically with flux. Conversely, small fluxes are penalized lightly, even if the associated weights are high. This results in the distribution of flux through alternative gap-filling solutions inversely proportional to the combined weights of reactions in a given pathway (Fig. 4). The number of reactions used in a QP gap-filling solution can thus provide a lower bound estimate on the number of reactions that can participate in gap-filling solutions. QP revealed that several thousand reactions could participate in gap-filling reactions for each organism (Fig. 5). Importantly, QP is not guaranteed to identify all potential gap-filling routes. Combinations of irreversible reactions and reaction weights can lead to hidden gapfilling reactions (Fig. 4). Removing reversibility constraints and adding random reaction weights allowed for an extreme estimate of gap-filling solutions in the gap-filling database. It was revealed that even more reactions could potentially participate in gap filling. For E. coli, this extreme QP solution included 7,337 reactions, almost double the number of reactions in the constrained QP solution.
The LP solutions were necessarily always contained in the QP solution for a given set of reversibility constraints and reaction weights. LP solutions that were based on uniform weights were mostly, but not always, contained in the QP solution (Figs. 4 and 5). The solutions of the Model SEED were more frequently outside the QP solution space (Fig. 5). The uniformly weighted LP solutions contained the lowest number of gapfilling reactions for the four tested networks and were probably the minimal reaction solutions in most cases. Strictly, the uniformly weighted LP solution is a minimal flux solution, making it imaginable that an alternative LP solution with fewer reac-FIGURE 2. Comparison of metabolites that required unsupported reactions to become producible. All four organisms shared a small subset of metabolites that required unsupported reactions. Further shared metabolite groups were Gram-specific, with Gram-negative species requiring fewer unsupported metabolites. Not all organisms had identical biomass equations; metabolites colored black were shared in all biomass equations, but metabolites colored green were specific to E. coli and metabolites colored blue were specific to B. subtilis and S. pneumoniae. RIBF, riboflavin; ACP, acyl carrier protein; GTA, glycerol teichoic acid.

FIGURE 3. Role of reactions in E. coli gap-filling solutions.
Removal of a single reaction from the gap-filling solutions revealed metabolites for which that reaction was essential for metabolite production. This relationship is shown as lines connecting the gap-filling reaction axis to the biomass metabolites axis. Reactions are grouped by ESS, and metabolites are grouped by class. A third axis illustrates the amount of flux through gap-filling reactions required for the production of a set biomass flux. In all cases, the gap-filling solutions included reactions with maximum ESS values and for which no alternatives existed, despite the large space of potential gap-filling solutions. The BLAST LP gap-filling solutions minimized flux through unsupported reactions, yet a small flux through unsupported reactions was always required.
tions, but that carry more flux, may exist. In contrast, the weighted LP solutions often contained high flux reactions, but only if such reactions were associated with very low weights (Fig. 3). The weighted LP solution always contained substantially more reactions than either the uniformly weighted LP or the Model SEED gap-filling solutions, suggesting that a strong enough ESS signal existed to significantly influence gap filling. LP and Model SEED gap-filling solutions often shared several reactions, indicating that some biomass components may only be made producible in a limited number of ways, but no solutions were close to identical (Figs. 5 and 6).
The sheer size of the QP solution made it clear that many different gap-filling solutions existed (Fig. 6). 78% of the metabolites of the original network were used in the quadratic gapfilling solution of S. pneumoniae (Fig. 6), suggesting that there were no obvious metabolites that should serve as connecting metabolites to gap-filling reactions. With the size and level of connections in mind, it was surprising that gap-filling solutions shared reactions at all. Further investigation revealed that the shared reactions were often associated with high ESS values and sometimes represented a sole gap-filling solution to a subset of biomass components (Table 2 and supplemental Tables S1 and  S2). The high ESS values of the shared reactions indicated that, in reality, alternative pathways to these biomass components may exist and highlighted the possible existence of missing biochemistry in the gap-filling database.
Comparison of Computational and Experimental Gene Essentiality-To investigate the quality of the gap-filling solutions, gene essentiality predictions from the gap-filled networks were compared with experimental data of full-length single   gene knock-out libraries. Gene deletions were simulated by removing all reactions that required a given gene. Gene essentiality was then predicted from the feasibility of biomass production from the specified media using FBA. Gene deletions that resulted in networks that could no longer produce biomass were considered computationally essential.
Networks that were gap-filled with weighted gap filling (referred to as BLAST LP) predicted gene knock-out outcomes better than networks filled by the Model SEED or uniformly weighted gap filling (Table 3). Weighted gap filling outperformed the alternatives methods both at the essential and nonessential gene predictions. The improved performance for both essential gene and nonessential gene predictions was striking because the weighted gap filling added significantly more reactions to networks, yet this did not result in fewer true essential gene predictions (except for in B. subtilis). More importantly, it suggested that the ESS signal was strong enough to enhance gap filling of draft metabolic networks, although all of the solutions included reactions associated with maximum ESS values. All genes associated with supported reactions in the BLAST LP gap-filling solutions for the E. coli network were consistent with RAST annotations. However, the additional functionalities associated with AceE and SucB (supplemental Table S1) were not supported by the literature (47) and were probably incorrect.
A Subset of Knock-out Predictions Are Sensitive to Weight Changes-Two sensitivity analyses were performed to investigate the robustness of the network support (NS) for the weighted gap-filling solutions and the influence of the gap-filling solutions on gene essentiality predictions. NS is here defined as how well a gap-filling reaction selection is determined by the entire network. NS for a reaction is calculated from the gap-filling penalty function increase after exclusion of that reaction from the gap-filling database. To test which gene essentiality predictions were sensitive to any gap-filling solution, 100 Monte Carlo gap-filling simulations of the E. coli network with randomly shuffled weights were calculated. This resulted in 97 genes with alternating essentiality prediction (supplemental Table S2), suggesting that a significant portion (9.1%) of gene predictions were sensitive to gap filling.
In a second sensitivity analysis, weights were shifted by a small random amount from the sequence-derived weights to test sensitivity to variations in the ESS calculation. Only 18 genes changed predicted essentiality status over 100 runs (sup-plemental Table S2). This suggested that the sequence-based gap-filling approach was fairly robust to variations in the BLAST sequence comparisons. The number of genes changing essentiality prediction was fairly insensitive to the magnitude of the added noise, ranging from a normal distribution centered at zero, with an S.D. from 0.1 to 10 units (reaction weights scaled between 0 and 553), indicating that most essentiality predictions were well determined by the gap-filling approach.
The BLAST LP optimal gap-filling solution utilized 15 reactions, eight of which were present within all shifted weight Monte Carlo gap-filling solutions. The additional reactions varied substantially over the Monte Carlo runs, but the number of reactions that were featured at least once in the gap-filling solutions was insensitive to the magnitude of the noise (supplemental Table S2). Of the eight reactions that were always retrieved, four had minimal ESS values, and four had maximum ESS values, including one mandatory reaction for which no alternative existed. Removal of any of the eight reactions that were always included resulted in solutions with substantially higher objectives, indicating strong NS and explaining their consistent inclusion. 17 of the 18 genes with variable essentiality predictions were essential in the noiseless solution. Remarkably, 15 of these 17 computationally essential predictions were wrong, which was on par with random predictions, considering that only 10% of genes were experimentally essential. Note that overall, 70% of the experimentally essential genes were correctly predicted as essential. However, of the computationally essential genes, only 44.6% were experimentally essential. Disregarding the 18 genes with alternating essentiality calls improved the latter statistic to 48.3%.
Combined, these results indicated that the ESS signal was strong enough to determine Ͼ80% of otherwise variable essentiality predictions. The seven gap-filling reactions for which the inclusion was sensitive to reaction weights determined the 18 gene essentiality calls that were of very poor quality (supplemental Table S2). Therefore, the implied presence of orphaned enzymes in all networks did not nullify the ability to find meaningful gap-filling solutions, but the poorly determined reactions significantly deteriorated a subset of essentiality calls.
Analysis of Gap-filling Reactions with High ESS Values-A subset of the metabolites that could not be produced by significantly supported reactions still required unsupported reactions for production after breaking the biomass equation into independent export reactions. These unsupported metabolites

predictions by gap-filled networks
Compared with the Model SEED and uniformly weighted gap fillings, BLAST LP resulted in metabolic networks that had equal or improved predictions for both essential and nonessential genes in three of four organisms. Surprisingly, the uniformly weighted solutions, which always contained the fewest reactions, did not result in networks with more computationally essential genes. Essentiality predictions are compared with experimentally essential and nonessential observations. EE, experimentally essential observations; ENE, experimentally nonessential observations. were investigated in more detail by using the BLAST LP algorithm on the gap-filled networks to calculate flux utilization of the gap-filling reactions for unsupported metabolites (Fig. 7). The two reactions required by E. coli for riboflavin, FAD, and TPP synthesis had strong NS values and were therefore always included in shifted weight sensitivity analysis. The remaining reaction associated with spermidine synthesis was included in 71 of 100 solutions. This suggested that these reactions were strongly determined by the gap-filling approach. In contrast, only the reaction associated with riboflavin and FAD was always included in the shuffled weight sensitivity analysis because no alternative reaction was present in the gap-filling database. The reactions for TPP and spermidine synthesis were only included in 4 and 24 cases of 100 in the shuffled weight solutions, respectively. This suggests that despite the lack of ESS support for these reactions, NS strongly determined gapfilling reactions among many other poor alternatives.
Two reactions implicated by NS were supported by circumstantial evidence. The gap-filling export reaction for TPP may be through spontaneous diffusion due to the chemical properties of 4-hydroxy-benzyl alcohol, a byproduct of TPP synthesis. Riboflavin and FAD required a reaction for which no alternative existed in the Model SEED database. This reaction has been hypothesized in the literature, and only recently, a gene in E. coli has been associated with the activity (48).

Discussion
Draft metabolic networks of four species were investigated for the ability to produce a complete set of biomass metabolites. The observed inability of networks to produce biomass, even after the removal of thermodynamic and stoichiometric constraints caused by dead-end metabolites, necessitated the addition of gap-filling reactions. Although each network could be readily filled using the Model SEED biochemistry database, no networks could be filled solely with reactions that were supported by sequence similarity to known enzymes. The need for orphaned enzymes implied that all metabolic networks were missing essential biochemistry annotations. Possibly, these reactions are of unknown biochemistry, suggesting fundamental gaps in our biochemistry knowledge for even the best-studied organisms. This realization suggests that our biochemistry knowledge or inclusion of this knowledge in the database, rather than the quality of machine annotations, is limiting our ability to further improve automated network reconstructions. Note that given the very small flux requirement through unsupported reactions (Fig. 3), it is conceivable that some of the orphaned activities may be attributed to secondary catalytic activity of promiscuous enzymes.
The presence of orphaned enzymes in gap-filling solutions and the very large size of the solution spaces, made evident by the quadratic programming, prompted the question of how robust the gap-filling solutions were in response to noise and to what extent gene essentiality predictions were influenced by gap-filling solutions. One hundred repeated gap-filling runs using randomly shuffled weights for the E. coli network showed that a substantial number of reactions could be part of the gapfilling solution, which resulted in many alternate gene essentiality assignments (supplemental Table S2). However, in response to noise added to the correct weights, a much smaller subset of genes showed alternating gene essentiality. This suggested that many gene essentiality predictions sensitive to the gap-filling solutions were strongly determined by the sequencederived weights. Additionally, eight gap-filling reactions were always present in gap-filling solutions, suggesting that they were strongly determined by NS. Interestingly, the essentiality of the group of genes sensitive to gap filling was predicted very poorly, which suggested that the fallout of the partially arbitrary gap-filling process due to a simplified relationship between E-values and ESS as well as the addition of orphaned enzymes may be limited to a small subset of gene essentiality predictions.
LP-and QP-based gap-filling algorithms generated fast and meaningful gap-filling solutions. LP optimization resulted in gap-filled networks that better predicted gene essentiality compared to networks that were filled with existing gap-filling technology. The large majority of gap-filling reactions was supported by sequence similarity and had often been identified by RAST, yet these reactions had not been included in the Model SEED draft models. The fairly insignificant computational time to establish ESS values (2 h/organism on a quad core Intel i3 desktop computer) should be well worth the effort, although the network quality improvement may be modest. This is par- ticularly true for the inclusion of BLAST LP in network reconstruction pipelines.
This work demonstrated that orphaned enzymes were integral to essential metabolic functions and that a fully supported and functionally complete metabolic network could not be assembled even with the extensive compilation of enzymes and biochemistry from RAST and the Model SEED. Nonetheless, sequence similarity-driven gap filling improved the quality of the networks and identified deficiencies in our biochemistry knowledge. The large set of significantly supported gap-filling reactions in all gap-filling solutions showed the potential for network-based identification of candidate gene annotations. Nonetheless, truly realistic models will probably require further expansion of the Model SEED biochemistry database or the discovery of not yet observed metabolic reactions and their gene associations.

Glossary
Metabolic Network-Relational structure of interconnected chemical reactions that constitute metabolism.
Functional Role-Activity that an individual protein performs as part of an enzyme complex.
Model SEED-Online automated metabolic reconstruction service for prokaryotes.
Locally Orphaned Enzyme-Enzyme activity that is observed or predicted to occur in an organism but for which no gene has been discovered in that organism.
Globally Orphaned Enzyme-Enzyme activity that is observed or predicted to occur in biology but with which no gene has been associated.
Stoichiometric Matrix-Matrix representing the stoichiometry of the metabolites in the reactions of a metabolic network.
Metabolic Network Reconstruction-Systematic representation of enzyme-catalyzed reactions that occur in an organism.
Genome-scale Metabolic Model-Comprehensive metabolic model that includes reactions for all enzymatic genes.
Draft Metabolic Network-A network assembled from gene annotations, often created algorithmically.
Gene Essentiality-Binary property of a gene that is specific to a given environmental condition. A gene is essential if the model requires that gene to produce biomass.
Dead-end Metabolite-Metabolite that is either not consumed or produced by any reaction in a metabolic network.
Isolated Reaction-Reaction with at least one product AND substrate that do not participate in another reaction.
Blocked Reaction-Reaction with at least one product OR substrate that does not participate in another reaction.
Gap-Reaction(s) missing from a network reconstruction that result(s) in isolated or blocked reactions.
Gap-filling Solution-Set of reactions that, upon addition to a network, allow synthesis of a set of target metabolites from a defined nutrient source.
Top-down Gap Filling-Addition of the full reaction database to a draft network followed by pruning of reactions until a minimal gap-filling solution is obtained.
Bottom-up Gap Filling-Iterative addition of reactions to a metabolic network until a gap-filling solution is obtained.
Optimization-based Gap Filling-Gap-filling approach that aims to minimize or maximize a gap-filling objective through numerical optimization.
Linear Programming (LP)-Maximizing or minimizing of a (multivariate) linear equation given specified linear constraints and parameter bounds.
Quadratic Programming (QP)-Maximizing or minimizing of a (multivariate) quadratic equation given specified linear constraints and parameter bounds.
Mixed Integer Linear Programming (MILP)-Linear programming method where some parameters must be integers.
Weighted Linear Programming-Linear programming with weight coefficients for each linear equation.
Weighted Quadratic Programming-Quadratic programming with weight coefficients for each quadratic equation.
Flux Balance Analysis (FBA)-Linear programming method that calculates a flux solution optimized for an inferred metabolic function, such as maximum biomass yield.
Enzyme Sequence Support (ESS)-Measure of similarity between known enzymes that catalyze a reaction and the best matched sequence region in the target genome.
Network Support (NS)-Network-wide measure of support for the inclusion of a gap-filling reaction. NS for a reaction is determined from the gap-filling penalty function increase in the absence of that reaction.
Author Contributions-E. W. K. and I. G. L. L. designed and performed the experiments, analyzed the data, and wrote the manuscript.