Model-enabled gene search (MEGS) allows fast and direct discovery of enzymatic and transport gene functions in the marine bacterium Vibrio fischeri

Whereas genomes can be rapidly sequenced, the functions of many genes are incompletely or erroneously annotated because of a lack of experimental evidence or prior functional knowledge in sequence databases. To address this weakness, we describe here a model-enabled gene search (MEGS) approach that (i) identifies metabolic functions either missing from an organism's genome annotation or incorrectly assigned to an ORF by using discrepancies between metabolic model predictions and experimental culturing data; (ii) designs functional selection experiments for these specific metabolic functions; and (iii) selects a candidate gene(s) responsible for these functions from a genomic library and directly interrogates this gene's function experimentally. To discover gene functions, MEGS uses genomic functional selections instead of relying on correlations across large experimental datasets or sequence similarity as do other approaches. When applied to the bioluminescent marine bacterium Vibrio fischeri, MEGS successfully identified five genes that are responsible for four metabolic and transport reactions whose absence from a draft metabolic model of V. fischeri caused inaccurate modeling of high-throughput experimental data. This work demonstrates that MEGS provides a rapid and efficient integrated computational and experimental approach for annotating metabolic genes, including those that have previously been uncharacterized or misannotated.

Whereas genomes can be rapidly sequenced, the functions of many genes are incompletely or erroneously annotated because of a lack of experimental evidence or prior functional knowledge in sequence databases. To address this weakness, we describe here a model-enabled gene search (MEGS) approach that (i) identifies metabolic functions either missing from an organism's genome annotation or incorrectly assigned to an ORF by using discrepancies between metabolic model predictions and experimental culturing data; (ii) designs functional selection experiments for these specific metabolic functions; and (iii) selects a candidate gene(s) responsible for these functions from a genomic library and directly interrogates this gene's function experimentally. To discover gene functions, MEGS uses genomic functional selections instead of relying on correlations across large experimental datasets or sequence similarity as do other approaches. When applied to the bioluminescent marine bacterium Vibrio fischeri, MEGS successfully identified five genes that are responsible for four metabolic and transport reactions whose absence from a draft metabolic model of V. fischeri caused inaccurate modeling of high-throughput experimental data. This work demonstrates that MEGS provides a rapid and efficient integrated computational and experimental approach for annotating metabolic genes, including those that have previously been uncharacterized or misannotated.
The development of next-generation sequencing technologies has generated thousands of genome sequences. These are primarily annotated by a combination of bioinformatics methods that are both fast and can be applied genome-wide. Homol-ogy-based bioinformatics methods (e.g. BLAST) assume similar sequences share similar functions. Structure-based methods (1,2) and genomic context-based methods (e.g. conserved operon, gene fusions, and gene co-occurrence across genomes (3)(4)(5)) can be utilized to infer functions that are difficult to annotate using BLAST alone. These bioinformatics methods are often used in conjunction with high-throughput experimental data (including gene expression, protein-protein interactions, mass spectrometry, RNAi, and mutant fitness) to suggest gene functions based on connections between genes with known and unknown functions. For example, recent studies have used correlations in transposon mutant-fitness scores across multiple experimental conditions to improve genome annotations (6). Despite the power of these bioinformatics methods, and the increasing availability of high-throughput data, 40 -60% of newly sequenced genes still lack assigned functions (7)(8)(9). In addition, although bioinformatics methods can quickly predict specific gene functions, biochemical characterization must be still performed separately to validate those predictions. In fact, a majority of the gene functions assigned have no experimental evidence. For example, as of August 2016, only ϳ27% of the entries in the UniProtKB Swiss-Prot knowledgebase contain experimental evidence at the protein or transcript level (10). Direct experimental testing of gene functions that are proposed bioinformatically or based on high-throughput experiments is needed to reduce the high rate of incomplete and incorrect annotations (8,11,12). Such assessment is important because functions incorrectly assigned to gene sequences enter databases that are subsequently used to assign functions to new sequences. As a result, errors that are hard both to identify and to correct will propagate.
Consequently, it is crucial to develop approaches that quickly identify missing and/or erroneously assigned gene functions and to provide fast and direct experimental validation for the correct function. Such goals can be achieved by combining genome-scale metabolic modeling with experimental techniques. Genome-scale metabolic models are developed primarily based on genome annotations obtained from bioinformatics tools. Current model-based algorithms, including GapFill, SMILEY, and GrowMatch, can use cell culture data to pinpoint knowledge gaps caused by missing or incorrectly called gene functions; however, these algorithms cannot identify candidate genes for these functions (13)(14)(15). More recent algorithms, including PHiller-GC, Model SEED, ADOMETA, and MIRAGE, identify missing metabolic reactions and candidate genes that might be responsible for catalyzing them (16 -19), but these algorithms require additional data such as annotated sequences from other organisms and/or expensive gene-expression datasets that might not be available. Importantly, all of these current model-based approaches still do not provide direct experimental validation of the function of the candidate gene. Here, we propose a high-throughput model-enabled gene search (MEGS) 2 method that rapidly identifies functions for unannotated or misannotated genes. The metabolic modeling procedures in MEGS quickly generate a list of missing or erroneous functions in genome annotations derived from bioinformatics tools and design functional selection experiments (experiments where only strains that gain an essential function from a genomic library are able to grow) to select for genes with these functions. Subsequent functional selection experiments identify the responsible gene(s) from a genomic library and provide fast and direct experimental evidence for the gene's function. In contrast to metagenomic functional selections, which have been used to identify ribulose-bisphosphate carboxylase/ oxygenase, DNA polymerase, and antibiotic-resistance genes (20 -22), MEGS' functional selections are based on knowledge gaps identified by metabolic modeling. By using genomic functional selections, MEGS does not rely on sequence similarity or genomic context to find gene functions and, as such, can be used to discover functions for previously uncharacterized groups of genes. As such, MEGS complements existing bioinformatics tools to improve genome annotations. MEGS was successfully used to identify the enzymatic and transport functions of five genes in Vibrio fischeri, which were subsequently confirmed by a combination of experiments involving complementation, mutant growth phenotyping, qPCR analysis, and enzyme assays.

Overview of MEGS
MEGS involves three steps that combine metabolic modeling and experimentation (Fig. 1). First, a genome-scale metabolic model of an organism of interest is developed, and physiological experiments are performed to validate the model. Computational tools (14,15) are then used to pinpoint metabolic function(s) that are missing from the model but are needed to resolve model data discrepancies. The physiological experiments suggest these missing functions occur, but the discrepancies between model predictions and experiments indicate that the functions are absent from current genome annotations. For example, the V. fischeri model originally lacked genes involved in catabolism of both D-xylose and mannitol; however, 2  Step 1 Step 2 Step 3 rxn 1 rxn 2 rxn 3 rxn 2 ?

Figure 1. Overview of MEGS.
In the first step, a metabolic model for an organism of interest is constructed. Target reactions, which are either missing from the model or assigned to the wrong genes, are inferred from discrepancies between model predictions and experimental measurements. A recipient strain (derived from a well-characterized organism, e.g. E. coli) and selective medium are then designed for each target reaction. Such recipient strains can only grow in the selective medium if they acquire heterologous enzymes that catalyze the target reaction. Finally, a genomic library is created, and the recipient strain is used to select for genes capable of catalyzing the target reactions, because such genes will enable growth of the recipient strain in the selective medium. The discovered genes can then be further characterized and added to the model to improve predictions.
only genes involved in mannitol catabolism were identified by the model as missing because V. fischeri grew on mannitol (but not D-xylose) as a sole carbon source. Second, a recipient strain (derived from a well-characterized organism, e.g. Escherichia coli) and selective medium (e.g. a minimal medium supplemented with a single carbon source) are designed such that the recipient strain can only grow in the selective medium if the gene(s) presumptively encoding the missing metabolic function (from the organism of interest) is transferred to the recipient strain. Such pairs of recipient strains and selection media can be computationally designed for a reaction of interest using a forced coupling algorithm (23). Third, a genomic functional selection experiment is performed to locate the gene(s) in the genome that are responsible for the missing metabolic function and provide direct evidence for the gene's function. For this last step, a genomic library of the organism of interest is created by inserting random genomic DNA fragments into plasmids. This plasmid library is then transformed into the recipient strain. Gene(s) responsible for the missing metabolic function can then be identified by sequencing plasmids that complement growth of recipient strains in the selective medium. The discovered genes can then be further characterized and added to the model to improve predictions. As more experimental data are generated, any new model data discrepancies that arise can be used to drive additional MEGS cycles.

MEGS applied to discover V. fischeri gene functions
In this work, we applied MEGS to discover and characterize several enzyme-and transporter-encoding genes of V. fischeri ES114, which is a bioluminescent marine bacterium that forms a symbiotic relationship in the light-emitting organ of the Hawaiian bobtail squid, Euprymna scolopes (24). Its metabolic capabilities are representative of many other marine bacteria, including both beneficial and pathogenic members of the Vibrio genus, and are thus of particular interest (25). From the KEGG (Kyoto Encyclopedia of Genes and Genomes)-annotated genome, we reconstructed a genome-scale metabolic model for V. fischeri ES114, named iVF846 (supplemental Excel file S1). Reactions and metabolites from an E. coli model, iJO1366 (26), were transferred into the draft model of iVF846 when orthologs to E. coli metabolic genes were found in V. fischeri. The draft iVF846 model was then curated based on the following: (i) data and information reported in the literature, and (ii) new growth-phenotyping experiments using Biolog plates, a method for individually testing the ability to metabolize 96 different carbon sources using a microtiter dish format. To facilitate model curation, a modified version of the SMILEY (14) algorithm (see under "Experimental procedures") was used to identify missing enzymatic or transport reactions that, if added to the model, would resolve discrepancies between model predictions and experimental growth phenotypes of V. fischeri ES114 wild type and mutants. This analysis identified that V. fischeri was missing an annotated aspartate 1-decarboxylase (encoded by panD in E. coli), which caused the draft model to predict no growth either in LBS or in a V. fischeri defined minimal medium (DMM) (Fig. 2a). The growth-phenotyping experiments were performed to identify sole carbon sources that support growth of V. fischeri (see under "Experimental procedures"). These results (supplemental Excel file S2) were compared with model-predicted sole carbon sources using flux-balance analysis (FBA) (27), and discrepancies were found for mannitol and N-acetylneuraminate. The modified SMILEYalgorithmpredictedthatmannitol1-phosphate5-dehydrogenase (encoded by mtlD in E. coli) and an N-acetylneuraminate transporter (encoded by nanT in E. coli) were missing from the draft model (Fig. 2, b and c). Finally, FBA was used to predict essential V. fischeri genes in LBS medium, and gene essentiality predictions were compared with a recent transposon insertion study (supplemental Table  S1) (28). One false-positive prediction was for glutamine synthase (VF_0098), where the model predicted the gene was essential but experimentally it was found to be non-essential. Based on this discrepancy, the modified SMILEY algorithm predicted the draft model was missing glutamine transporter(s) (Fig. 2d). To experimentally identify the V. fischeri genes encoding the missing aspartate 1-decarboxylase, and mannitol 1-phosphate 5-dehydrogenase, as well as transporters for N-acetylneuraminate and glutamine, a specific E. coli recipient strain and selective medium were designed for each missing metabolic function (supplemental Table S2). The iJO1366 metabolic model was used to demonstrate in silico that the growth of each E. coli recipient strain requires the specified missing metabolic function in selective medium (Fig. 3, a-d). Here, recipient E. coli strains were used because their metabolism is well characterized, and knock-out mutant collections are available (29,30). For clarity, we will refer to the well-characterized E. coli genes using their gene symbols and V. fischeri genes using their locus tags. The locus tags of the E. coli genes are listed in supplemental Table S2. A 53-gigabase pair V. fischeri genomic library with ϳ12,000-fold genome coverage was created and transformed into the recipient strains: ⌬panD, ⌬mtlD, ⌬nanT, and ⌬glnP. After growth selection in each recipient strain's selection media, we found that plasmids expressing VF_0892, VF_A0062, and VF_0668 rescued ⌬panD, ⌬mtlD, and ⌬glnP, respectively, and plasmids independently expressing either VF_0924 or VF_1172 rescued ⌬glnP.

Functional complementation in recipient strains
We individually cloned VF_0892, VF_A0062, VF_0668, VF_0924, and VF_1172 into an empty vector because plasmids in the genomic library can contain fragments that encode more than one gene. All transformed single-gene plasmids enabled growth of recipient strains in the corresponding selective medium (Fig. 4, a-d). This result demonstrated that these genes complemented recipient-strain growth and thus are functionally equivalent to the analogous E. coli genes. In the experiments with the glutamine transporter, glutamine was supplied as the sole carbon source. Wild-type E. coli grows poorly on glutamine as a sole carbon source due to low uptake rates (31). Overexpression of either VF_0924 or VF_1172 in the E. coli ⌬glnP resulted in a faster growth rate compared with the wild-type E. coli strain expressing an empty-vector control (Fig. 4d). Growth dependence for each recipient strain in selective medium was calculated using iJO1366 (26). Feasible combinations of growth rate and a missing metabolic enzyme are shown in blue. In all cases, cell growth is not zero only when there is flux through the reaction on the y axis (because no non-trivial solutions exist on the x axis). The flux limits of oxygen and carbon uptake rates were set at 10 mmol/g dry weight (gDW) per h. a, aspartate 1-decarboxylase activity (associated with panD in iJO1366) is coupled to growth of a ⌬panD mutant in glucose minimal medium. b, mannitol 1-phosphate 5-dehydrogenase activity (associated with mtlD in iJO1366) is coupled to growth of a ⌬mtlD mutant in mannitol minimal medium. c, N-acetylneuraminate transport (ACNAMt2pp associated with nanT in iJO1366) is coupled to growth of a ⌬nanT mutant in N-acetylneuraminate minimal medium. d, glutamine transport (GLNabcpp associated with glnHPQ in iJO1366) is coupled to growth in a double knock-out ⌬glnP⌬ansB in glutamine minimal medium. The ansB was not deleted experimentally because it has a low activity with glutamine (66), and ⌬glnP mutants (67,68) were previously shown to be unable to grow on glutamine.

Functional complementation in the organism of interest, V. fischeri
The ⌬VF_0892, ⌬VF_A0062, or ⌬VF_0668 V. fischeri knockout mutants did not grow in minimal medium (similarly to the selective medium, where DMM instead of MOPS minimal medium was used); however, the V. fischeri knock-out mutants were complemented by plasmids expressing wild-type copies of VF_0892, VF_A0062, or VF_0668, respectively (Fig. 5, a-c). In addition, the ⌬VF_0892 mutant could grow in glucose minimal media if supplemented with pantothenate and/or ␤-alanine (supplemental Fig. S1). Better growth of this mutant was observed with addition of 10 mM pantothenate compared with 10 mM ␤-alanine, possibly due to a slower ␤-alanine uptake. However, the transporter(s) in V. fischeri for ␤-alanine and pantothenate are, as yet, unknown. The ⌬VF_0924⌬VF_1172 double mutant still grew in DMM supplemented with glutamine as the sole carbon source, indicating the existence of other V. fischeri glutamine transporters in the genome. We tested whether VF_1172 (annotated as a tyrosine-specific transporter) can also transport leucine, by evaluating growth of E. coli BW25113 with a plasmid overexpressing VF_1172 in minimal medium supplemented with glucose and leucine (supplemental Fig. S2). Overexpression of VF_1172 slowed growth in media supplemented with leucine, a phenotype that can be attributed to leucine toxicity that was previously observed in an E. coli K-12 strain overexpressing branched-chain amino acid transporters (32). Like VF_1172, other amino acid transporters of V. fischeri might have broad substrate specificity.

Detection of ␤-alanine using in vitro enzyme assays
To further confirm the enzymatic function of VF_0892, we purified a His 6 -tagged VF_0892 protein and tested its functionality as an aspartate 1-decarboxylase in vitro. Only the reaction condition that contained both aspartate and the VF_0892 enzyme was able to produce ␤-alanine from aspartate (12.7 Ϯ 0.8 nmol, where reported error is the standard deviation across three biological replicates). In contrast, conditions containing aspartate alone, VF_0892 enzyme alone, aspartate and heatinactivated VF_0892 enzyme, or aspartate and proteins purified from cells containing the empty vector did not produce any detectable ␤-alanine (less than 0.125 nmol). Proteins purified from cells containing the empty vector were used as a control to provide information of contaminant proteins (supplemental Fig. S3a). These in vitro enzyme assays further confirmed that VF_0892 (1644 bp) can catalyze the conversion of aspartate to ␤-alanine and thus is functionally equivalent to panD (381 bp) despite their sequence dissimilarity (i.e. BLAST found no significant similarity between the two proteins).

Kinetic characterization of PanD, VF_0892, and VF_1064 enzymes
VF_0892 is currently annotated in NCBI as a glutamate decarboxylase (EC 4.1.1.15). To compare the activity and substrate specificity of VF_0892 and E. coli PanD toward the two potential substrates (aspartate and glutamate), decarboxylase activities (supplemental Fig. S3b) were evaluated using the DC-PEPC-MDH-linked assays at pH 8.05. Here, the kinetic parameters were determined from three technical replicates and are reported as an average Ϯ S.E. The k cat and K m values of the VF_0892 enzyme were 0.075 Ϯ 0.04 M CO 2 /M enzyme-sec and 1.44 Ϯ 0.35 mM, respectively, at 28°C, and 0.008 Ϯ 0.001 M CO 2 /M enzyme-sec and 1.70 Ϯ 0.56 mM, respectively, at 37°C (Fig. 6a). The VF_0892 enzyme was around 10-fold more active at 28°C compared with 37°C. Precipitation of purified VF_0892 enzyme was observed over time at 37°C. Greater activity at 28°C is consistent with the optimal growth of V. fischeri at 28°C and its intolerance to higher temperatures. E. coli PanD showed a k cat of 0.008 Ϯ 0.001 M CO 2 /M enzyme-sec and K m of 1.44 Ϯ 0.33 mM at 37°C (Fig. 6b). However, it is possible that not all of the purified PanD was post-translationally processed into its active form, resulting in a low k cat . Both PanD and the VF_0892 enzyme showed a much higher reaction rate when aspartate, rather than glutamate, was used as the substrate (PanD, 10-fold; VF_0892, 5-fold at 37°C, and 26-fold at 28°C, Fig. 6c).
In addition to VF_0892, another V. fischeri gene (VF_1064) is currently annotated as a glutamate decarboxylase. Based on BLASTP, VF_1064 has 21% identity to VF_0892 with an E-value of 0.38 but 58% identity and an E-value of 0.0 to the E. coli glutamate decarboxylase gadB. We evaluated experimentally whether VF_1064 decarboxylates glutamate and/or aspartate. In the DC-PEPC-MDH-linked assays, the purified VF_1064 enzyme (supplemental Fig. S3b) showed about a 20-fold higher reaction rate with 10 mM glutamate than with 10 mM aspartate (supplemental Fig. S4a). With glutamate at pH 8.05 and 37°C, VF_1064 exhibited a k cat of 0.055 Ϯ 0.037 M CO 2 /M enzymesec and K m of 61.7 Ϯ 46.4 mM (supplemental Fig. S4b). Because of the solubility of glutamate, we were not able to test the enzyme kinetics of VF_1064 at glutamate concentrations at or greater than its K m value, resulting in large standard errors for k cat and K m . A pH of 8.05 was used to keep CO 2 produced by the decarboxylase primarily as HCO 3 Ϫ , but this pH might not be optimal for VF_1064 because E. coli GadB is most active at pH 3.8 (33). Additional experimental evidence suggests that VF_1064 cannot function as an aspartate 1-decarboxylase; specifically, only VF_0892, and not VF_1064, is essential in LBS medium (28). Also, the single gene plasmid expressing VF_1064 did not complement the ⌬panD E. coli mutant in the selective medium, whereas the plasmid expressing VF_0892 did (supplemental Fig. S5 and Fig. 4a).
Based on the measured substrate preferences for the VF_0892 and VF_1064 enzymes, we conclude that VF_0892 is associated with the aspartate 1-decarboxylase reaction, whereas VF_1064 is only associated with the glutamate decarboxylase reaction in the V. fischeri model.

Growth and VF_A0062 expression in wild-type V. fischeri using sole carbon sources
Using MEGS, we identified VF_A0062, a gene currently annotated as dehydrogenase, as a mannitol 1-phosphate 5-dehydrogenase. Here, we provide more evidence that VF_A0062 is involved in mannitol and not L-sorbose utilization. VF_A0062 has significant similarity to genes annotated as L-sorbose 1-phosphate reductases in other Vibrio species, including Vibrio rotiferianus and Vibrio owensii (BLASTP E-value ϭ 0, amino acid identity Ͼ88%). In addition to the high-throughput growth-phenotyping assays, we confirmed in shake flasks that L-sorbose does not support growth of wildtype V. fischeri as a sole carbon source, although mannitol and glucose do (supplemental Fig. S6). qPCR analysis showed a 6.5fold increase in expression of VF_A0062 in the mannitol minimal medium compared with the glucose minimal medium. (A relative expression ratio of 6.5 Ϯ 2.3 was calculated from three biological replicates. The reported error is the standard deviation). In some other Vibrio species, e.g. Vibrio cholerae and Vibrio metoecus, which do not contain a gene similar to VF_A0062, an ortholog of mtlD is present in their genomes instead. Therefore, Vibrio species apparently contain a mannitol 1-phosphate 5-dehydrogenase that is similar to either mtlD in E. coli or VF_A0062 in V. fischeri.

Squid light-organ colonization phenotypes
We further tested whether V. fischeri knock-out mutants (⌬VF_0892, ⌬VF_A0062, ⌬VF_0668, and ⌬VF_0924⌬VF_ 1172) could successfully colonize the squid light organ in competition with a wild-type strain (we excluded VF_0892 due to its inability to grow in LBS medium without supplementation of ␤-alanine or pantothenate). We observed no significant colonization phenotypes in our knock-out mutants during the initial  stage of colonization (24 h after inoculation) (supplemental Fig.  S7). However, VF_A0062 and VF_0668 might play a role in persistence after colonization because transposon insertions in either VF_A0062 or VF_0668 failed to persist in the squid when competing with a pool containing other transposon library mutants and the wild type (after 48 h of colonization) (28).

Discussion
In this work, we developed MEGS to improve gene annotations by combining metabolic modeling with genomic functional selection. Computational models, built from an existing genome annotation, were used to identify missing or incorrect annotations in the current genome annotation and to design selections (using other organisms) for genes responsible for these missing functions. We successfully identified five genes responsible for four metabolic functions that were missing from our draft V. fischeri metabolic model. Using MEGS, we provided the first experimental evidence that (i) VF_0892 functions as an aspartate 1-decarboxylase; (ii) VF_A0062 functions as a mannitol 1-phosphate 5-dehydrogenase; (iii) VF_0668 functions as an N-acetylneuraminate transporter, and (iv) VF_0924 and VF_1172 function as glutamine transporters. Importantly, none of these genes are orthologous to the E. coli genes with the same functions. These discoveries improved the quality of the V. fischeri genome-scale metabolic model, iVF846, which has been used in studying V. fischeri metabolism and its symbiotic relationship in the squid light organ, especially during the habitat transition between seawater and the symbiotic niche (34).
MEGS leverages both computational and experimental techniques to provide functional annotations for genes encoding enzymes and transporters. Metabolic genes are responsible for the physiological and biochemical states of a cell, and knowledge of their functions is critical for understanding and controlling cell behavior. By taking advantage of metabolic modeling, MEGS identifies errors and omissions in existing genome annotations due to either a lack of experimental evidence or prior knowledge in databases and designs experiments to correct these errors. Because MEGS uses genomic functional selections to find genes instead of sequence similarity or genomic context, it can discover genes with unique sequences and/or genes that have not been studied in the laboratory. MEGS is experimentally and computationally inexpensive and efficient. A draft genome-scale metabolic model can be prepared automatically using available software platforms in only a few hours (17,(35)(36)(37). Metabolic reactions missing from such a draft model or associated with the wrong genes, as well as genomic functional-selection strategies, can also be identified within a few hours (23). The genomic library used in our method (which takes 2 days to construct) contains information from across the entire genome, so that all the genes in the library go through the selection simultaneously. Once a library is created, MEGS cycles can be repeated to search for additional genes associated with different missing reactions. Similarly, once a recipient strain is constructed, it can be used to select for genes responsible for a particular metabolic reaction from multiple genomes (using separate genomic libraries). The time required to build the recipient strains depends on the number of gene additions and deletions needed; however, this time can be significantly reduced by using existing mutant strain collections (29,38,39). Functional selection from the genomic library via growth complementation of the recipient strain in a selective medium is fast (1-3 days) and also provides direct experimental evidence of gene functions. One cycle of the entire MEGS process can be completed within a week using existing recipient strain collections.
MEGS offers an alternative and complementary approach to bioinformatics-based methods for predicting gene functions. In retrospect, some of the genes we identified using MEGS are also top-scoring candidates that have been or could be derived bioinformatically using various genomic context-based methods. VF_0892 was already tentatively suggested to be a member of the pyridoxal-dependent aspartate 1-decarboxylases (TIGR03799) by partial phylogenetic profiling (29). This protein family was given a suggested name of PanP, to distinguish it from a non-orthologous family of aspartate 1-decarboxylase (PanD TIGR00223), which is pyruvoyl-dependent and more widely distributed than PanP. PanP is present in a number of marine bacteria (a list of all proteins that belong to TIGR03799 are listed in supplemental Table S3); however, no direct experimental evidence was available previously to support its annotation as an aspartate 1-decarboxylase. We found that PanP was not properly incorporated in six manually curated genomescale metabolic models. Five of the models included an aspartate 1-decarboxylase reaction without any associated genes (40,41) and one associated PanP with L-cysteate, 3-sulfino-L-alanine, glutamate, and aspartate decarboxylase reactions (42). Similarly, VF_A0062 could have been predicted as a mannitol 1-phosphate 5-dehydrogenase candidate using bioinformatics methods because it is located in the same operon as yggD (VF_A0063), and YggD has been characterized as a mannitol operon repressor in Shigella flexneri (34). VF_0668 is a predicted member of SiaR regulon (controlling sialic acid degradation) according to the RegPrecise database (43), and it was tentatively annotated as a possible sialic acid transporter (permease), NanT. The SiaR regulon in Haemophilus influenzae includes a different tripartite ATP-independent periplasmic transporter (44). However, the function of the V. fischeri NanT has not yet been experimentally characterized.
Similar to operon-and regulon-based bioinformatics methods, MEGS does not depend on sequence similarity to wellcharacterized genes. Annotations based solely on sequence similarity may lack detailed functions when the unknown sequence is not similar to a characterized gene (e.g. conserved hypothetical proteins) or may be incorrect if the gene is similar to a gene that is incorrectly annotated in sequence databases. In the case of VF_A0062, we demonstrated that this gene actually encodes a mannitol 1-phosphate 5-dehydrogenase even though it shares high sequence similarity with other annotated L-sorbose 1-phosphate reductase genes.
MEGS and bioinformatics-based approaches have different strengths and limitations and, as a result, can complement each other. For example, MEGS can complement bioinformaticsbased approaches when top-scoring candidates do not exist. Sometimes correlations between genes with unknown function and genes with known functions may not exist, and such corre-lations can still lead to erroneous and/or nonspecific function assignments using bioinformatics methods alone (8). Another strength of MEGS compared with the bioinformatics-based approach is that genes identified from MEGS already have direct experimental evidence for their functions from the genomic-library selection experiments, while bioinformaticsderived gene functions must be tested in subsequent experiments. MEGS can be applied to organisms for which there are currently no genetic tools, opening up ways to evaluate their gene functions experimentally in another host. Some limitations of MEGS include the following: (i) it requires heterologous expression in the recipient strain, which might not be optimal; (ii) the recipient strains might be difficult to construct (e.g. essential genes cannot be deleted unless growth can be complemented by nutritional supplementation); and (iii) if multiple genes are responsible for a missing metabolic function, they must be co-localized on the chromosome. Additionally, MEGS might identify genes with low promiscuous activities that, when overexpressed, can complement growth defects associated with essential metabolic functions. Recent approaches for improving heterologous gene expression (45) and conditional mutation systems (46) are likely to help overcome some of these limitations. Ultimately, more detailed biochemical characterization of enzymes identified using bioinformatics or MEGS might be needed to confirm the physiological functions of gene products. However, these computational and experimental approaches are useful for identifying which genes to evaluate biochemically.
As more genome sequences become available, we speculate that MEGS will be successfully applied to further discover the roles of uncharacterized genes and improve our understanding of metabolism in a variety of both familiar and uncharacterized microorganisms. Newly discovered gene functions will propagate through genome databases as they are used by other existing approaches to improve annotations of genes in additional organisms.

In silico modeling
A newly constructed genome-scale metabolic model of V. fischeri strain ES114, iVF846, was used in this work (supplemental Excel file S1). Reactions and metabolites from an E. coli model, iJO1366 (26), were transferred into the draft model of iVF846 when orthologs to E. coli metabolic genes were found in V. fischeri. By our definition, the orthologous genes shared the KEGG ortholog identifier and were best reciprocal hits in the KEGG Sequence Similarity Database. The model contains 846 genes and 1583 reactions. FBA (27) was used to calculate the growth rate of V. fischeri by maximizing flux through a defined biomass objective function. In FBA, the minimal or rich medium was simulated by giving negative values to the lower limits for the exchange fluxes of the medium components in the model (supplemental Table S4 lists negative exchanges used for different simulations). A carbon source was predicted to be a sole carbon source if the FBA-predicted growth rate was greater than zero in minimal medium supplemented with this carbon source. To simulate a knock-out strain, the fluxes of metabolic reactions associated with the gene were fixed at zero, unless isozymes were present. A gene was predicted as essential if the FBA-predicted growth rate of the strain containing a single gene deletion was zero. A modified version of the mixed integer linear programming algorithm, SMILEY (14), was used to predict missing metabolic genes (and their associated reactions) when the model incorrectly predicted wild-type or mutant V. fischeri strains cannot grow in a particular medium. The algorithm was modified from its original implementation by minimizing the total number of metabolic genes (instead of reactions) that need to be added from iJO1366 (26) (instead of KEGG) to the V. fischeri model to enable growth and reconcile false predictions.

Strain construction
Wild-type E. coli BW25113 and wild-type V. fischeri ES114 were used in this work. E. coli knock-out strains (derived from BW25113) containing kanamycin resistance genes were obtained from the Keio collection (Open Biosystems) (29,30). The temperature-sensitive plasmid pCP20 was used to remove the kan gene from the mutants as described previously (47). The resulting kanamycin-sensitive E. coli knockout strains were used as recipient strains for the V. fischeri genomic library. Knock-out mutants of V. fischeri ES114 were constructed using conjugation and homologous recombination as described previously (48 -51). To construct ⌬VF_0892, 10 mM pantothenate and 10 mM ␤-alanine were supplemented in the LBS growth medium. All strains used in this study are reported in supplemental Table S5.

Plasmid construction for single-gene complementation
For E. coli complementation experiments, a single V. fischeri gene was cloned into the multiple cloning site of pZE21MCS (EXPRESSYS) using Gibson cloning. The construct was transformed into the corresponding E. coli knock-out strains, and colonies were selected on LB agar containing 50 g of kanamycin per ml. For V. fischeri complementation experiments, a single V. fischeri gene was cloned into the multiple cloning site of pVSV105 (52) using Gibson cloning. These plasmids containing a V. fischeri gene were introduced into V. fischeri ES114 knock-out strains by conjugation. All plasmids used in this work are listed in supplemental Table S5.

Growth conditions and complementation experiments
Unless otherwise noted, E. coli strains were grown at 37°C, and V. fischeri strains were grown at 28°C, both with shaking at 225 rpm. E. coli strains were grown in Luria-Bertani (LB) or a MOPS-buffered minimal medium (53). Because E. coli grows poorly on glutamine as a sole carbon source, vitamin supplements (0.05 mM thiamine, 0.05 mM niacinamide, and 20 nM biotin) were added to the minimal medium to shorten the experiments in which the glutamine transporter is complemented. V. fischeri strains were grown in Luria-Bertani salt (LBS) (24) or V. fischeri DMM (supplemental Table S6). Overnight cultures of ⌬VF_0892 were grown in LBS supplemented with 10 mM pantothenate and 10 mM ␤-alanine. When appropriate, 50 g of kanamycin or 5 g of chloramphenicol per ml was added to the media. For complementation experiments, an overnight LB (or LBS) culture of each strain was subcultured by ϳ1:100 dilution into fresh minimal medium with a starting absorbance of 0.02 at 600 nm (A 600 ). The A 600 of the culture in a 96-well plate was measured by an Infinite M200 plate reader (Tecan) every 15 min with 3-mm orbital shaking. In complementation experiments with a VF_0892 plasmid in ⌬panD or ⌬VF_0892, an overnight LB (or LBS) culture of each strain was washed twice in minimal medium and subcultured by 1:100 dilution into fresh minimal medium. Once the subculture grew to about mid-exponential phase, it was washed twice and subcultured again into fresh minimal medium for the growth curve measurements. All experiments testing a VF_A0062 plasmid were performed at 28°C, because pZEVFA0062 did not complement ⌬mtlD well when grown at 37°C.

Growth-phenotyping experiments
To test sole carbon sources of V. fischeri, V. fischeri strain ES114 was grown in inoculating fluid supplemented with 255 mM sodium chloride, on PM1 or PM2A plates containing single carbon sources (Biolog) according to the manufacturer's protocol. The plates were incubated at 28°C in an OmniLog incubator reader (Biolog), and turbidity was measured every 15 min for 48 h. The increase in turbidity was compared with a negative control with no added carbon source to determine whether cells grew. Additional experiments were performed in 17 ϫ 100-mm test tubes to confirm growth of strains on a carbon source of interest or to detect strains with an intermediate increase in turbidity.

Genomic library construction
The genomic DNA of V. fischeri ES114 strain was extracted from LBS culture using the DNeasy Blood and Tissue kit (Qiagen). The extracted DNA was then fragmented at 10% amplitude for 5 s using the Sonic Dismembrator Model 500 (Thermo Fisher Scientific). DNA fragments between ϳ2 and ϳ5 kbp were size-selected from a 1% agarose gel in Tris acetate/EDTA buffer and purified with the Zymoclean Gel DNA Recovery kit (Zymo). The DNA fragments were ligated into the HinCII site of the multiple cloning site of the pZE21MCS1 vector and transformed into 50 l of E. coli MegaX suspension (Invitrogen) following the protocol of Forsberg et al. (22). The average insert size of the library was ϳ2.3 kilobase pairs, and the titer of the library was 22 million colony forming units (CFUs). Transformed cells were transferred to 10 ml of LB containing 50 g of kanamycin per ml and grown overnight. The overnight culture was used to extract the library of plasmids using the QIAprep Spin Miniprep kit (Qiagen).

Gene selection from a V. fischeri genomic library
The V. fischeri genomic library was transformed into competent cells of an E. coli recipient strain and recovered in 1 ml of Super Optimal broth with Catabolite repression (SOC) medium for an hour. The recovered cells were then pelleted at 6000 rpm for 3 min and transferred to 50 ml of selective medium (listed in supplemental Table S2) with 50 g of kanamycin per ml to grow at 37°C. After the A 600 reached 1, the cells were subcultured into 50 ml of fresh selection medium. After the A 600 of the cells reached 1 again, they were plated on LB plus kanamycin plates. Single colonies were picked to confirm growth in selective medium, and the first and last 700 bp of the plasmid inserts were subsequently sequenced to identify the V. fischeri gene(s) included in the plasmid.

Expression and purification of decarboxylases
E. coli panD (b0131), V. fischeri panP (VF_0892), and V. fischeri gadA (VF_1064) were amplified from the genomic DNA using Phusion High-fidelity DNA Polymerase (New England Biolabs). The PCR fragments containing VF_0892 were digested with NcoI and XhoI and cloned into pET28 (Novagen). The resulting plasmid pETVF0892 was transformed into E. coli X90(DE3) competent cells (54). The PCR fragments containing b0131 or VF_1064 were inserted into the pET28 vector with an N-terminal His-tag sequence followed by a tobacco etch virus site. The resulting plasmids pETb0131 and pETVF1064 were transformed into E. coli BL21 (DE3) competent cells. For expression, an inoculum was started in LB medium plus 50 g of kanamycin per ml from an overnight culture. The cells were grown at 28°C until reaching an A 600 of 0.6. Then cells were induced by 1 mM isopropyl ␤-D-1-thiogalactopyranoside overnight at 18°C. The cell lysate was extracted from the collected cells by BugBuster Master Mix (EMD Millipore) following the manufacturer's protocol. The lysate was incubated with HisPur nickel-nitrilotriacetic acid resin (Thermo Fisher Scientific) for 1 h at 4°C and passed through a Pierce Centrifuge Column (Thermo Fisher Scientific). The column was washed in 50 mM NaH 2 PO 4 , 300 mM NaCl, and 20 mM imidazole (pH 8), and the VF_0892 protein was eluted in 50 mM NaH 2 PO 4 , 300 mM NaCl, and 250 mM imidazole (pH 8). The eluted protein was dialyzed against 100 mM Tris-HCl (pH 7.5) and concentrated with an Amicon Ultra Centrifugal Filter Unit (EMD Millipore). The final products were analyzed with SDS-polyacrylamide gels, and their concentrations were determined with a bicinchoninic acid kit (Sigma). The eluted protein was dialyzed in 20 mM Tris-HCl to reduce the amount of solutes that went through the GC-MS column. E. coli X90(DE3) containing the empty pET28b(ϩ) vector was subjected to the same expression and purification procedures, and the product was used as a control in detecting V. fischeri PanP activity using GC-MS.

Detection of ␤-alanine from in vitro enzyme assays
The V. fischeri PanP activity was detected by measuring the formation of ␤-alanine in five different conditions, each with a 200-l reaction volume. All reaction mixtures contained 20 mM Tris-HCl, 5 mM MgSO 4 , and 750 M pyridoxal 5Ј-phosphate (pH 7.5). In addition to the buffer components, the first condition also contained 36 g of purified V. fischeri PanP and 50 nmol of aspartate as the substrate. The second condition contained 36 g of purified V. fischeri PanP but no substrate. The third condition contained 50 nmol of aspartate but no protein.
The fourth condition contained 36 g of purified V. fischeri PanP that had been heat-inactivated at 90°C for 30 min and 50 nmol of aspartate. The last condition contained 20 g of protein obtained from the empty vector lysate and 50 nmol of aspartate. The reaction mixtures were incubated at 37°C for 15 h and then stopped by heat inactivation at 90°C for 30 min. The second and the fourth conditions were tested in biological duplicates, and the other conditions were tested in biological triplicates. The heat-inactivated reaction mixtures were spun down at 15,000 rpm for 3 min, and their supernatants were taken for quantitation. A uniformly labeled, ␤-[U-13 C]alanine internal standard (Cambridge Isotope Laboratories) was used for quantification via an isotope-ratio method using GC-MS (55). Prior to sample quantification, a suitable ␤-alanine fragment formula was identified. This process includes the following: (i) analyzing the mass spectrum of an unlabeled ␤-alanine sample to propose a feasible structure for the molecular ion of interest; (ii) comparing the theoretical and measured isotopic distributions of said structure; and (iii) verifying that the number of carbon backbone atoms present on the derivatized structure matches the base peak mass shift predicted to be seen in the ␤-[U-13 C]alanine mass spectrum relative to the unlabeled spectrum (56). Once a fragment for quantification was identified, the purity (i.e. extent of labeling) of labeled ␤-alanine was measured and subsequently used to correct for the unlabeled portion of the labeled standard in downstream calculations. Following this step, aliquots of labeled internal standard were quantified using a known amount of unlabeled standard and calculating the ratio of 12 C/ 13 C after correcting both for natural abundances of other isotopes present in the derivatized fragment, and for the unlabeled portion of the labeled standard, using the freely available software, IsoCor (57). Once characterized, known amounts of labeled standard were mixed with the supernatant of the enzymatic assay and dried at 90°C. This step was followed by derivatization in a volumetric 1:1 ratio of pyridine with N-tert-butyl-dimethylsilyl-N-methyltrifluoroacetamide plus 1% tert-butyl-dimethylchlorosilane at 90°C for 30 min to confer thermal stability and increased volatility amenable for analysis on the GC-MS instrument. Derivatized samples were centrifuged at 15,000 rpm for 3 min to sediment insoluble material, producing a cleaner supernatant for injection onto the GC-MS. Samples were run on a single quadrupole GC-MS QP2010S (Shimadzu) in electron ionization mode with an Rtx of 5 ms (Restek) low-bleed, fused-silica column for separation with helium as a carrier gas operating under a linear velocity control mode with a split ratio of 0.50, and a column flow of 1.50 ml/min. The temperature program for separation of ␤-alanine began with holding the column oven temperature at 35°C for 10 min, ramping up at 25°C/min to 300°C, and holding for 19.4 min. Operational parameters included an injection temperature of 240°C, ion source temperature of 260°C, interface temperature of 240°C, and a mass scan range of 100 -450 m/z. To test the detection limit of our ␤-alanine quantification method, the method was used to measure a known amount of unlabeled ␤-alanine. Each sample (n ϭ 3) contained 0.125, 0.25, 0.5, 1, or 2 nmol of unlabeled ␤-alanine. The method was able to detect the presence of unlabeled ␤-alanine in all these samples. Therefore, the detection limit of the method is below 0.125 nmol.

DC-PEPC-MDH-linked assays
Decarboxylase (DC), phosphoenolpyruvate carboxylase (PEPC), and malate dehydrogenase (MDH)-linked assays were performed by mixing 80 l of freshly prepared mixture A and 120 l of mixture B. Mixture A contained 100 mM Tris-HCl, 10 mM MgSO 4, 1 mM pyridoxal 5Ј-phosphate (PLP), and various concentrations of aspartate and glutamate (pH 8.05). Mixture B contained purified decarboxylase and Infinity Carbon Dioxide Liquid Stable Reagent (Infinity, Thermo Fisher Scientific). The final decarboxylase concentrations used were 54 M PanD, 35 M VF_1064 enzyme, 28 M V. fischeri PanP for assays at 37°C, and 14 M V. fischeri PanP for assays at 28°C. Like similar DC-PEPC-MDH-linked assays (58 -62), our assays were carried out in air. The interference of exogenous CO 2 from air and buffer solution was accounted for using a negative control, in which no substrate was added. The signal produced by the negative control was linear during the measurement period and was subtracted from all signals produced by other samples containing substrates. Values of K m and k cat were determined from nonlinear fitting into the Michaelis-Menten equation using Kaleida-Graph (Synergy Software). Averages across the three technical replicates were used as data points, whereas standard deviations for each data point were used as weights during the curve fitting. The K m and k cat values are reported with standard errors.

qPCR analysis
Wild-type V. fischeri cells were harvested at an A 600 of ϳ0.6 in DMM supplemented with glucose or mannitol as sole carbon source. Total RNA from the cell pellets was harvested using the Quick-RNA MicroPrep Kit (Zymo Research) with a 15-min oncolumn DNase treatment. RNA concentration and quality were assessed on a NanoDrop (Thermo Fisher Scientific) spectrophotometer. Reverse transcription for synthesizing cDNA and real-time qPCR were performed using the GoTaq Two-step RT-qPCR System (Promega) and an AriaMx real-time PCR machine (Agilent). The differential expression of VF_A0062 (forward primer, TGGATATTCCGGGTGGTAAA, and reverse primer, ACGGGTCTTGTTCTGCAAGT) in the mannitol minimal medium and the glucose minimal medium was normalized to the control gene (V. fischeri polA, VF_0074, forward primer, CGACAGCAGCAGAAGTGAAG, and reverse primer, AGCAAGACCAAACGCACTC) using the GED formula (63).

Growth inhibition test with leucine
An overnight LB culture of each E. coli strain was washed twice in minimal medium and subcultured by 1:100 dilution into fresh MOPS minimal medium with 20 mM glucose. Once the subculture grew to about mid-exponential phase, it was washed twice again and subcultured into fresh minimal medium with or without 4 mM leucine, and its A 600 was measured on a 96-well plate by an Infinite M200 plate reader (Tecan).

Squid colonization competitions
Freshly hatched juvenile squid were collected from the rearing facility at the University of Hawaii and placed in filter-sterilized seawater. Squid were exposed to an ϳ1:1 mixed population of two strains consisting of V. fischeri ES114 carrying a chromosomally inserted erythromycin-GFP marker (64) and the indicated mutant strain, at a total of 3000 -5000 CFU/ml. Squid were incubated with bacteria for 3 h, then transferred to individual vials of V. fischeri-free seawater for an additional 18 -21 h. Colonized squid were subsequently anesthetized on ice and placed at Ϫ80°C for surface sterilization. Individual squid were then homogenized and plated on LBS and LBS ϩ erythromycin agar plates as described previously (65), and the ratio of strains present in the light organ was determined by counting the unmarked and erythromycin marked colonies. The relative competitive index (RCI) of the co-colonizing strains was calculated as follows: RCI ϭ log(CFU mutant/CFU wild type)/(inoculum CFU mutant/inoculum CFU wild type).