Global Gene Expression Profiling in Escherichia coliK12

Leucine-responsive regulatory protein (Lrp) is a global regulatory protein that affects the expression of multiple genes and operons in bacteria. Although the physiological purpose of Lrp-mediated gene regulation remains unclear, it has been suggested that it functions to coordinate cellular metabolism with the nutritional state of the environment. The results of gene expression profiles between otherwise isogenic lrp + andlrp − strains of Escherichia colisupport this suggestion. The newly discovered Lrp-regulated genes reported here are involved either in small molecule or macromolecule synthesis or degradation, or in small molecule transport and environmental stress responses. Although many of these regulatory effects are direct, others are indirect consequences of Lrp-mediated changes in the expression levels of other global regulatory proteins. Because computational methods to analyze and interpret high dimensional DNA microarray data are still an early stage, much of the emphasis of this work is directed toward the development of methods to identify differentially expressed genes with a high level of confidence. In particular, we describe a Bayesian statistical framework for a posterior estimate of the standard deviation of gene measurements based on a limited number of replications. We also describe an algorithm to compute a posterior estimate of differential expression for each gene based on the experiment-wide global false positive and false negative level for a DNA microarray data set. This allows the experimenter to compute posterior probabilities of differential expression for each individual differential gene expression measurement.

During the last 50 years, a great deal of knowledge about the regulation of gene expression in Escherichia coli has been obtained. We now know that the expression of genetic information is regulated at three hierarchical levels: global control of basal level gene expression by chromosome structure, control of regulons and stimulons by global regulatory proteins, and operon-specific controls (1,2). At the most general level, the expression of all genes is regulated by DNA supercoiling-dependent mechanisms that affect the topology of the entire chromosome (3). At the next level, large groups of genes are regulated by abundant regulatory proteins with rather degenerate binding site specificity that, in cooperation with operon-specific controls, regulate often-overlapping groups of metabolically related operons, called regulons and stimulons, in response to environmental or metabolic signals. At the most basic level, individual genes or operons are regulated by less abundant proteins that bind in a site-specific manner to one or a few sites to regulate single genes or operons. Isolated examples of each level of control have been described. However, the definition of these hierarchical control levels in a depth sufficient to understand genetic regulatory networks on a global scale, all the way from specific circuits up to the complete regulatory network of the cell, remains to be elucidated (4). Before we can infer and model these regulatory networks, individual components at each hierarchical level must be identified. In other words, a more complete definition of the genes of specific regulons and stimulons must be obtained. It is now possible to obtain much of this information using high-throughput technologies such as DNA microarrays.
The purpose of the work presented here is to identify the network of genes that are differentially regulated by the global E. coli regulatory protein, leucine-responsive regulatory protein (Lrp), 1 during steady state growth in a glucose supplemented minimal salts medium. Lrp is a DNA-binding protein that has been reported to affect the expression of approximately 55 genes. 2 In most cases, Lrp has been reported to activate operons that encode genes for biosynthetic enzymes and repress operons that encode genes for catabolic enzymes (5,6). The intermediary metabolite, L-leucine, is required for the binding of Lrp at some of its DNA target sites; however, at other sites L-leucine inhibits DNA binding, and at still other sites it exerts no effect at all. Although the physiological purpose of Lrp-mediated gene regulation remains unclear, it has 1 The abbreviations used are: Lrp, leucine-responsive regulatory protein; ATP␥S, adenosine 5Ј-O-(thiotriphosphate); MMLV, Moloney murine leukemia virus; BSA, bovine serum albumin; PPDE, posterior probability for differential expression; MOPS, 4-morpholinepropanesulfonic acid; MES, 4-morpholineethanesulfonic acid; dH 2 O, distilled water; ORF, open reading frame; AD, average difference; PM, perfect match; MM, mismatch. 2 Although the expression levels of approximately 55 genes have been reported to be affected by Lrp, these observations were obtained under a wide variety of environmental and nutritional growth conditions; thus, the expression of some of these genes might not be affected by the presence or absence of Lrp under the growth conditions employed in this study.
Experiments-For random hexamer-primed cDNA synthesis, 20 g of total RNA and 37.5 ng of random hexamer primers were heated at 70°C for 3 min and quickly cooled on ice. cDNA synthesis was performed at 42°C for 3 h in a 60-l reaction mixture containing: RNA and primer mixture; reverse transcriptase buffer (Roche); 1 mM amounts each of dATP, dGTP, and dTTP; 50 Ci of [␣- 33 P]dCTP; 20 units of ribonuclease inhibitor III; and 4 l (88 units) of avian myeloblastosis virus reverse transcriptase. Labeled cDNA targets were separated from unincorporated nucleotides on Sephadex G-25 Quickspin columns.
mRNA Enrichment and Target Labeling Conditions for the Affymetrix GeneChip TM Experiments-To enrich the proportion of mRNA in the total RNA preparation, 300 g of total RNA from IH-G2490 (lrp ϩ ) or IH-G2491 (lrp Ϫ ) was prepared as described above. Each 300-g total RNA preparation was split into 12 aliquots to increase the efficiency of the enrichment procedure. All reactions were performed in PCR tubes in a thermocycler. For each reaction, 25 g of total RNA were mixed with 70 pmol of a rRNA-specific primer mix in a final volume of 30 l. Each specific primer mix included three specific primers for 16 S rRNA (5Ј-CCTACGGTTACCTTGTT-3Ј, 5Ј-TTAACCTTGCGGCCGTACTC-3Ј, and 5Ј-TCCGATTAACGCTTGCACCC-3Ј) and five specific primers for 23 S rRNA (5Ј-cctcacggttcattagt-3Ј, 5Ј-CTATAGTAAAGGTTCACGGG-3Ј, 5Ј-TCGTCATCACGCCTCAGCCT-3Ј, 5Ј-TCCCACATCGTTTCCCA-C-3Ј, and 5Ј-CATGGAAAACATATTACC-3Ј). This mixture was heated to 70°C for 5 min and quickly cooled to 4°C. 10 l of 10ϫ MMLV reverse transcriptase buffer (0.5 M Tris-HCl (pH 8.3), 0.1 M MgCl 2 , and 0.75 M KCl), 5 l of 10 mM dithiothreitol, 2 l of 25 mM dNTPs mix, 3.5 l of 20 units/l SuperRNasin, 6 l of 50 units/l MMLV reverse transcriptase, and water were added to each tube to a final volume of 100 l. The reactions were incubated at 42°C for 25 min, and incubation was continued at 45°C for 20 min for cDNA synthesis. To remove the rRNA moiety from the rRNA/cDNA hybrid, 5 l of 10 units/l RNase H was added and the mixture was incubated at 37°C for 45 min. RNase H was inactivated by heating at 65°C for 5 min. Newly synthesized cDNA was removed by incubation with 4 l of 2 units/l DNase I and 1.2 l of 20 units/l SuperRNasin at 37°C for 2 h. Four reactions were combined for RNA cleanup with a single Qiagen RNeasy mini column. The quantity of enriched mRNA was measured by absorbance at 260 nm. A typical yield is 10 -20 g of RNA from 300 g of total RNA constituting a 10 -20-fold enrichment of mRNA to rRNA.
For the RNA fragmentation, a maximum of 20 g of RNA was added to a PCR tube containing 10 l of 10ϫ NEB buffer for T4 polynucleotide kinase in a final volume of 88 l. The tube was incubated at 95°C for 30 min and cooled to 4°C.
For the RNA 5Ј-thiolation and biotin-labeling reaction, 2 l of 5 mM ATP␥S and 10 l of 10 units/l T4 polynucleotide kinase were incubated with the fragmented RNA at 37°C for 50 min. The reaction was inactivated by heating to 65°C for 10 min and cooled to 4°C. Excess ATP␥S was removed by ethanol precipitation. Fragmented thiolated RNA was collected by centrifugation in the presence of glycogen (0.25 g/l) and resuspended in 90 l of distilled water (dH 2 O). 6 l of 500 mM MOPS (pH 7.5) and 4.0 l of 50 mM polyethylene oxide-iodoacetylbiotin were added to the fragmented thiolated RNA and incubated at 37°C for 1 h. The biotin-labeled RNA was isolated by ethanol precipitation, washed twice with 70% ethanol, and dried and dissolved in 20 -30 l of molecular biology grade water. The quantity of the biotinlabeled RNA was measured by absorbance at 260 nm. The total yield for the entire procedure is typically 2-4 g of biotin-labeled RNA from 300 g of total RNA. The efficiency of RNA fragmentation and biotin labeling can be monitored with a gel shift assay where the biotin-labeled RNA is pre-incubated with avidin prior to electrophoresis. Biotin-labeled RNA is retarded during electrophoresis because of the avidinbiotin interaction. The position of the RNA in the gel addresses the fragmentation efficiency. The amount of shifted RNA indicates the efficiency of the biotin labeling. Inefficiencies in either of these parameters should be addressed before proceeding to the hybridization step.
Hybridization to Nylon Filters-The nylon filters were soaked in 2ϫ SSPE (20ϫ SSPE contains 3 M NaCl, 0.2 M NaH 2 PO 4 , and 25 mM EDTA) for 10 min and prehybridized in 10 ml of prehybridization solution (5ϫ SSPE, 2% SDS), 1ϫ Denhardt's solution (50ϫ Denhardt's solution contains 5 g of Ficoll, 5 g of polyvinylpyrrolidone, 5 g of bovine serum albumin, and H 2 O to 500 ml), and 0.1 mg/ml sheared herring sperm DNA) for at least 1 h at 65°C. 5-7 ϫ 10 7 cpm of cDNA targets in 500 l of prehybridization solution were heated at 95°C for 10 min, rapidly cooled on ice, and added to 5.5 ml of prehybridization solution. The prehybridization solution was removed and replaced with the hybridization solution. Hybridization was carried out for 15-18 h at 65°C. Following hybridization each filter was rinsed with 50 ml of 0.5ϫ SSPE containing 0.2% SDS at room temperature for 3 min, followed by three washes in the same wash solution at 65°C for 20 min each. The filters were partially air dried, wrapped in Saran Wrap, and exposed to a phosphor screen for 15-30 h. Filters were stripped by microwaving at 30% maximal power (1400 watts) in 500 ml of 10 mM Tris solution (pH 8.0) containing 1 mM EDTA and 1% SDS for 20 min. Stripped filters were wrapped in Saran Wrap and stored in the presence of damp paper towels in sealed plastic bags at 4°C.
Hybridization to Affymetrix GeneChips-For hybridization of biotinylated RNA targets to the Affymetrix GeneChips, 2-4 g of fragmented biotin-labeled RNA of IH-G2490 (lrp ϩ ) and IH-G2491 (lrp Ϫ ) were used for each GeneChip. The hybridization solution for each array was prepared with 100 l of 2ϫ MES hybridization buffer (200 mM MES, 2 M NaCl, 40 mM EDTA, and 0.02% Tween 20), 1 l of 100 nM Biotin-Oligo 948 (5Ј-biotin-GTCAAGATGCTACCGTTCAG-3Ј), 2 l of 10 mg/ml herring sperm DNA, 2 l of 50 mg/ml BSA, and 2-4 g of fragmented biotin-labeled RNA and brought final volume to 200 l with molecular biology grade water. The GeneChip arrays were equilibrated to room temperature immediately before use. The hybridization solution prepared above was added to each GeneChip and incubated in a GeneChip hybridization oven (Affymetrix) at 45°C for 16 h at a rotation rate of 60 rpm.
Following hybridization, the stain and wash procedures were carried out in an Affymetrix GeneChip Fluidics Station 400 using the ProKGE-WS2 fluidics script to run the machine. Streptavidin solution mix (300 l of 2ϫ MES stain buffer, 24 l of 50 mg/ml BSA, 6 l of 1 mg/ml streptavidin, and 270 l of dH 2 O), antibody solution (300 l 2ϫ MES stain buffer, 24 l of 50 mg/ml BSA, 6 l of 10 mg/ml normal goat IgG, 6 l of 0.5 mg/ml biotin anti-streptavidin, and 264 l of dH 2 O) and SAPE solution (300 l of 2ϫ MES stain buffer, 24 l of 50 mg/ml BSA, 6 l of 1 mg/ml streptavidin-phycoerythrin, and 270 l of dH 2 O) were prepared in amber tubes for the staining of each probe array. After hybridization, the hybridization solution was removed and kept at 4°C. Each GeneChip TM was filled with 300 l of nonstringent wash buffer (6ϫ SSPE, 0.01% Tween 20, 0.005% Antifoam). The GeneChips were inserted into the fluidics station, and the ProKGE-WS2 protocol was selected to control the staining and washing of the probe arrays. After the procedure was complete, the GeneChips were removed from the fluidics station and checked for large bubbles or air pockets before scanning. The buffer in the GeneChips were drained and refilled with nonstringent buffer if bubbles were present.
Experimental Design for Nylon Filter DNA Array Experiments-The experimental design for the nylon filter DNA array experiments reported here is diagrammed in Fig. 1. In experiment 1, filters 1 and 2 were hybridized with 33 P-labeled, random hexamer-generated cDNA targets complementary to each of three RNA preparations (RNA 1-3) obtained from the cells of three individual cultures of the lrp ϩ strain (IH-G2490). These three 33 P-labeled cDNA target preparations were pooled prior to hybridization to the full-length ORF probes on the filters (experiment 1). Following PhosphorImager analysis, these filters were stripped and again hybridized with pooled, 33 P-labeled cDNA targets complementary to each of another three independently prepared RNA preparations (RNA 1-3) from the lrp Ϫ (IH-G2491) (experiment 1). This procedure was repeated two more times with filters 1 and 2 using two more independently prepared pools of cDNA targets (experiment 2, RNA 4 -6). Another set of filters, filters 3 and 4, were used for experiments 3 and 4 as described for experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with cDNA targets complementary to four independent prepared sets of pooled RNA. Thus, because each filter contains duplicate spots for each ORF and duplicate filters were used for each experiment, four measurements for each ORF from each experiment were obtained. These four measurements for each experiment were averaged for further statistical analysis.
Experimental Design for Affymetrix GeneChip Experiments-The experimental design for the Affymetrix GeneChip experiments reported here is diagrammed in Fig. 2. The same 24 total RNA preparations used for the nylon filter experiments were pooled into sets of 3 and used for the preparation of biotin-labeled RNA targets for hybridization to Affymetrix GeneChips. For experiments 1-4, four GeneChips were hybridized with biotin-labeled RNA pools 1-3, 4 -6, 7-9, and 10 -12 prepared from lrp ϩ cells, and four GeneChips were hybridized with biotin-labeled RNA pools 1-3, 4 -6, 7-9, and 10 -12 prepared from lrp Ϫ cells, respectively. One average difference measurement for each gene probe set on each GeneChip was obtained for subsequent data processing and analysis.
Data Acquisition from the Nylon Filter DNA Array-A commercial software package obtained from Research Imaging Inc. (DNA ArrayVision) was used to grid the 16-bit image file obtained from the Phosphor-Imager, to record the pixel density of each of the 18,432 addresses on each filter, and to perform the background subtractions. 8,580 of the addresses on each filter are spotted with duplicate copies of each of the 4,290 E. coli ORFs. The remaining 9,852 empty addresses were used for background measurements. Because the backgrounds were quite constant, a global average background measurement was subtracted from each experimental measurement, although local background calculations are possible. Greater than 4 logs of linearity for the Phosphor-Imager-derived data were observed.
Data Acquisition of Affymetrix GeneChips-Each GeneChip array was scanned twice with an HP GeneArray confocal laser at a 3 M resolution, and the intensities at each perfect match (PM) and mismatch (MM) probe cell from both scans were averaged and saved as a *.DAT file. The average intensity of each GeneChip was globally scaled to 2500 and saved as a *.CEL. These probe pair measurements for each probe set were used for subsequent data processing and statistical analysis.
Model-based Oligonucleotide GeneChip Analysis-Gene expression values from Affymetrix GeneChips are based on the average difference (AD) between hybridization signals of PM and MM oligonucleotide probe sets for each gene as described in the expression analysis tech- nical manual from Affymetrix. The AD value of each probe set is calculated as AD ϭ ⌺(PM-MM)/number of probe pairs. Algorithms incorporated into Affymetrix software remove probe pairs that are out of a given range when calculating AD values for each probe set. In this process, the mean and standard deviation are calculated for intensity differences (PM Ϫ MM) across the entire probe set (excluding the highest and lowest values), and values within a set number of standard deviations (3 as default) are not included in the calculation. The advantage is that this process minimizes the variance introduced by experimental or biological error by removing the outliers present in each probe set. The disadvantage is this that this process does not always remove the same probe pairs for the calculation of the AD values among GeneChips. This can lead to the misinterpretation of the gene expression profiles obtained from GeneChip experiments. To alleviate this problem, a model-based method incorporated into a program called dChip has been described by Li and Wang (11). This method maintains constant probe pair set identities across all GeneChips while excluding outliers because of cross-hybridization, contamination during hybridization, or manufacturing defects that affect probe set measurements. For all of the GeneChip experiments reported here, each probe pair set from the *.CEL files was modeled by the dChip software prior to statistical analysis. Statistical Methods-As described above, the experimental design employed in this study consists of 33 P-labeled cDNA target preparations for each of two genotypes hybridized to nylon filters, or three biotinylated mRNA target preparations hybridized to Affymetrix GeneChips. The designs for these experiments are depicted in Figs. 1 and 2. For each measurement, a background subtracted estimate of expression level for each gene was obtained and scaled to total counts by dividing each individual gene expression value by the total of all values on the filter or GeneChip. Thus, each normalized gene level is expressed as a fraction of the total mRNA hybridized to each DNA array.
For any given measurement, a value greater than zero (indicating an expression level) or a zero (indicating an expression level lower than background) is obtained. Only those genes exhibiting an expression level greater than zero in all experiments were used for statistical analysis. Gene measurements containing zero expression values were set aside. Among this set of genes, those with zero expression values for all measurements in one genotype, and all values greater than zero for all measurements of another genotype for each experiment were identified. The significance of these results was analyzed by ranking these genes in ascending order according to their coefficients of variance of the four greater than zero measurements. The remaining genes were analyzed both by a simple t test and a regularized t test based on a Bayesian statistical framework described under "Results and Discussion." Data Accession-All of the raw and processed data for the experimental results reported here are available in tabular format as supplemental data in the on-line version of this article.

An ad Hoc Method for the Estimation of Global False Positive
Levels-To interpret the results of a high dimensional DNA array experiment, it is necessary to determine the global false positive level inherent in the data set being analyzed. The global false positive level reflects all sources of experimental and biological variation inherent in a DNA array experiment. The basic idea is to infer the false positive level in the control versus treatment situation from the false positive level observed with the control versus control (and/or treatment versus, treatment) comparison. With this information, a global level of confidence can be calculated for differentially expressed genes measured at any given statistical significance level. For example, consider an experiment comparing the gene expression profiles of two genotypes, where an average of 10 genes are observed to be differentially expressed with a p value less than 0.0001 when gene expression profiles from one genotype are compared with data of the same genotype (e.g. lrp ϩ versus lrp ϩ or lrp Ϫ versus lrp Ϫ ). Because no differential expression is expected in these comparisons, these 10 genes are clearly false positives generated by chance occurrences driven by experimental errors and biological variance. Now, if 100 genes are differentially expressed with a p value less than 0.0001 when the data from one genotype (lrp ϩ ) are compared with the data from the other genotype (lrp Ϫ ), it is reasonable to infer that we can be only 90% confident that the differential expression of any one of these 100 genes is biologically meaningful because 10 false positives are expected from this data set. This example demonstrates that, although the confidence level based on the measurement for an individual gene may exceed 99.99% for two treatment conditions (local confidence of 0.0001), the confidence that this gene is differentially expressed might be only 90% (global confidence of 0.9). This example defines an ad hoc method of comparing control to control data to derive an estimate of an experiment-wide false positive level.
We applied this ad hoc method for the estimation of false positive levels of the experiments described here by averaging the four measurements for each gene from the duplicate control filters of each experiment hybridized with labeled targets from the control strain IH-G2490 (lrp ϩ ) and comparing these averaged values of control data from experiments 1 and 3 to the averaged values of control data from experiments 2 and 4 ( Fig. 1). In another analysis, we compared control data from experiments 1 and 4 to the averaged values of control data from experiments 2 and 3. Equivalent comparisons were performed with filters hybridized with labeled targets from the experimental strain (IH-G2491 (lrp Ϫ )). These particular two-by-two (control versus control or experimental versus experimental) comparisons were chosen because they average across experimental errors and biological differences both among filters and RNA preparations. The results of a simple t test analysis of these data were ranked in ascending order of the p values for each gene measurement based on the t test distribution. The results of these statistical analyses are shown in Table I.
The data in Table I show that, among the control versus control or experimental versus experimental comparisons, no genes exhibited a p value less than 0.0001. However, an examination of the p values observed when the control data were compared with the experimental data shows that 12 genes were differentially expressed with a p value less than 0.0001. Thus, we can be fairly certain that these 12 genes are differentially expressed because of biological reasons and not by chance occurrences driven by experimental error and biological variance. On the other hand, we know from the literature that more than 12 genes are regulated by Lrp (5, 6). This demon- strates that, given the experimental errors inherent in this experiment, the differentially expressed levels of most genes cannot pass this stringent statistical test. Therefore, to identify other differentially expressed genes, we must lower the stringency of our statistical criterion. The data in Table I show that, as the p value is raised to 0.005, we observe an additional 122 genes that are differentially expressed at this threshold level. At the same time, raising the statistical threshold to 0.005 reveals an average of 3.75 genes that appear differentially expressed with a p value equal to or less than 0.005 when the control or experimental data sets are compared with themselves. This means that, given this complete data set from four replicate experiments, we expect at least 3.75 false positives among the 134 genes differentially expressed with a p value equal to or less than 0.005. Therefore, our global confidence in the identification of any one of these 134 genes as differentially expressed genes is estimated to be 97%.
It should be emphasized that relaxing the p value threshold rapidly increases the average number of false positives in the control (lrp ϩ versus lrp ϩ or lrp Ϫ versus lrp Ϫ ) data sets relative to the number of genes differentially expressed at the same p value in the experimental (lrp ϩ versus lrp Ϫ ) data set and, therefore, decreases the confidence with which differentially expressed genes can be identified.
Improved Statistical Inference from DNA Array Data Using a Bayesian Statistical Framework-A simple t test evaluates the distance between the means of two groups normalized in terms of the within-group standard deviations. The result is that large differences between genotypes for any given ORF will be declared nonsignificant if the expression level of that ORF is unreplicable within experimental treatments. Conversely, small differences in expression will be determined to be statistically significant for a given ORF if expression levels for that ORF are replicable within treatments. In short, the t test statistic is constructed by scaling the difference in gene expression levels between genotypes relative to the observed variances within genotypes. p values based on the t test statistic range from 1.0 for gene expression levels with identical values associated with the null hypothesis to very small p values for differential gene expression levels that are highly significant.
In a perfect world, all DNA microarray experiments would be highly replicated. Such replication would allow accurate estimates of the variance within experimental treatments to be obtained, and the t test would perform well, i.e. the variance for each gene measurement would be based on many measurements for that gene. However, DNA microarray experiments are expensive and time-consuming to carry out. As a result, the level of replication within experimental treatments is often low. This results in poor estimates of variance and a correspondingly poor performance of the t test itself. On the other hand, we have shown that the confidence in the interpretation of DNA microarray data with a low number of replicates can be improved by using a Bayesian statistical approach that incorporates information of within treatment measurements (12,13). This results in a more consistent set of differentially expressed genes identified with fewer replicates. The Bayesian approach is based on the observation that genes of similar expression levels exhibit similar variance. Thus, more robust estimates of the variance of a gene can be derived by pooling neighboring genes with comparable expression levels. For the analysis of the data reported here, we ranked the mean gene expression levels of the replicate experiments in ascending order, used a sliding window of 101 genes, and assigned the average standard deviation of the 50 genes ranked below and above each gene as the background standard deviation for that gene. The variance of any gene within any given treatment then can be estimated by the weighted average of the treatment-specific background variance and the treatment-specific empirical variance across experimental replicates. In the Bayesian approach employed in this study, the weight given to the within experiment gene variance estimate is a function of the number of experimental replicates. This leads to the desirable property that the Bayesian approach employing such a regularized t test converges on the same set of differentially expressed genes as the simple t test but with fewer replicates (12).
A comparison of the results of statistical analyses employing a simple t test and a regularized t test is shown in Table II. Here, the simple ad hoc method of comparing controls to controls was used to demonstrate that the number of false positives expected at a given p value is lower when the Bayesian statistical framework is employed. For example, only 2 false positives are expected at a p value threshold less than 0.005 with the Bayesian regularization, whereas 3.75 false positives are expected at this same p value threshold with the t test alone. At the same time, 188 differentially expressed genes with a p value less than 0.005 are observed with the regularized t test, whereas only 134 genes are identified at this same threshold with the simple t test (Table II). Thus, more genes are identified with a lower false positive level and a higher global confidence level. In other words, complementing the empirical variance of the four experimental measurements for each gene with the corresponding background variance within an experiment improves our confidence in the identification of differentially expressed genes and the number of genes that can be identified at a given p value threshold based on a t test distribution.
Although the data in Table II show that the Bayesian statistical approach using a regularized t test identifies more genes with a higher level of global confidence than the simple t test, the natural question that arises is whether these genes are true positives, i.e. whether these are Lrp-regulated genes. This question is addressed by the data shown in Fig. 3. For example, of the 44 genes differentially expressed between lrp ϩ and lrp Ϫ strains with a p value less than 0.001 identified by a simple t test, 10 are known to be Lrp-regulated (Table III). However, among the 39 genes differentially expressed between lrp ϩ and lrp Ϫ strains with a p value less than 0.0001 identified by the Bayesian approach, 17 are known to be Lrp-regulated (Table IV). Why does the regularized t test identify more Lrp-regulated genes? The answer to this question lies in the fact that all of the genes identified to be differentially expressed with a p value less than 0.005 with the regularized t test exhibit -fold changes greater than ϳ1.7-fold (Fig. 3B). However, many genes identified to be differentially expressed with a p value less than 0.005 with the simple t test exhibit -fold changes as small as ϳ1.2-fold (Fig.  3A). Furthermore, the 100 genes with the lowest p values identified as differentially expressed by both methods contain only 43 genes in common. Thus, many of the genes identified by the simple t test that are excluded by the Bayesian approach are genes that show small -fold changes. In general, these genes with small -fold changes identified by the simple t test are associated with "too good to be true" within treatment variance estimates, reflecting underestimates of the within treatment variance when the number of observations is small. The elimination of this source of false positives by the Bayesian approach improves the identification of true positives. However, although this is desired, genes that are truly differentially expressed with small -fold changes in the range of ϳ1.2-1.7-fold will also be eliminated by the Bayesian approach. For example, of the 16 genes of the top 100 with the lowest p values identified by the simple t test that are known to be regulated by Lrp, one was not identified by the Bayesian method. This Lrp-regulated gene that did not pass the regularized t test was the sdaC gene, previously reported to be regulated by Lrp 3-fold (14,15) and measured to be regulated 1.9-fold in the experiment performed with the DNA arrays. Nevertheless, although this gene is lost, the overall performance of the regularized t test surpasses that of the simple t test. At first glance it might appear that the Bayesian approach validates the often-used 2-fold rule for the identification of differentially expressed genes (16), i.e. the identification of genes differentially expressed between two experimental treatments with a -fold change greater than 2 in, for example, three of four experiments. This type of reasoning is based on the intuition that larger observed -fold changes can be more confidently interpreted as a stronger response to the experimental treatment than smaller observed -fold changes, which of course is not necessarily the case. An implicit assumption of this reasoning is that the variance among replicates within treatments is the same for every gene. In reality, the variance varies among genes and it is critical to incorporate this information into a statistical test (12). Clearly, with a background standard deviation of, for example, 50, differential expression measurements of 200/100 and 20,000/10,000 have different significance. This point is further emphasized by simply examining the scatter plots in Fig. 3. Here, many genes that appear differentially expressed greater than 2-fold do not exhibit p values less than 0.005 and a global confidence level of at least 97%. This does not mean that these might not be Lrp-regulated genes; it simply means that they are false negatives that cannot be identified at this level of confidence.
Commonly used software packages do not possess algorithms for implementing Bayesian statistical methods. Therefore, we  developed a statistical program, CyberT, which does accommodate this approach. We use the statistical tools incorporated into CyberT to compare and analyze the gene expression data from the experiments described here. CyberT is available for on-line use on the genomics web site of the University of California, Irvine.
A Computational Method to Determine False Positive Levels-A computational version of our ad hoc method for estimating false-positive levels has been recently described (17). The basic idea is to consider the p values as a new data set and to build a probabilistic model for this new data. When control data sets are compared with one another (i.e., no differential gene expression), it is easy to see that the p values ought to have a uniform distribution between 0 and 1. In contrast, when data sets from different genotypes or treatment conditions are compared with one another, the distribution of p values will tend to cluster more closely to 0 than 1, i.e., there will be a subset of differentially expressed genes with "significant" p values. One can use a mixture of ␤ distributions to model this distribution of p values in the form shown in Equation 1.
For i ϭ 0, we use r 0 ϭ s 0 ϭ 1 to implement the uniform distribution as a special case of a ␤ distribution. Thus, K ϩ 1 is the number of components in the mixture and the mixture coefficients i represent the prior probability of each component. In many cases, two components (K ϭ 1) are sufficient but sometimes additional components are needed. In general, the mixture model can be fit to the p values using the Expectation Maximization algorithm or other iterative optimization methods to determine the values of the , r, and s parameters (17).  If we set a threshold T below which p values are considered significant and representative of change, we can estimate the rates of false positives and false negatives. The posterior probability for differential expression (PPDE) then can be calculated for each gene in the experiment with p value p as PPDE(p) according to Equation 6.
Alternatively, one can calculate a posterior probability of differential expression PPDE(Ͻ p) for values below a certain threshold p according to Equation 7.
The distribution of p values from our lrp ϩ versus lrp Ϫ data is shown in Fig. 4, and a plot of PPDE(p) and PPDE (Ͻ p) values versus p values is shown in Fig. 5. A comparison of the ad hoc method for determining the global significance for the differ- ential expression of a given gene and the computational method is presented in Table V. It is satisfying to see that these data compare well. It is clear from the data of Fig. 5 that for each p value threshold T, there is a tradeoff between the rates of true and false positives. A low conservative p value threshold leads to few FP but may also reduce the TP rate. A large p value threshold ultimately allows one to recover all the TP but at the cost of increasing the FP rate. This fundamental tradeoff is usually captured in statistics by using a receiver operating characteristic curve obtained by plotting the true hit rate (or sensitivity) defined by TP/(TP ϩ FN) versus the false hit rate, FP/(FP ϩ TN) (86). In the mixture model above, a simple calculation shows that for a given p value threshold T, With two components in the mixture (K ϭ 1), the last expression reduces to the following.
Thus, for our Lrp data the receiver operating characteristic curve in Fig. 6 is simply the distribution function of the second ␤ component in the mixture. For instance, this curve demonstrates that with a 76% true hit rate we can expect a 20% false hit rate. The Functional Classes of Genes Differentially Expressed in lrp ϩ and lrp Ϫ E. coli Strains-To facilitate the following discussions, we limit our considerations to the 100 genes differentially expressed with the lowest p value based on a regularized t test. The 100 genes differentially expressed between lrp ϩ and lrp Ϫ E. coli strains with a p value less than 0.0014 and a PPDE greater than 0.98 are listed in Table VI. In the text we simply refer to the -fold change for each gene. However, as emphasized above, it should be kept in mind that reporting -fold changes is incomplete and can be misleading. For this reason, mean expression levels, standard deviations, p values, and PPDE values for the 39 genes with p values less than 0.0001 are reported in Table IV. Additional statistical data for the remaining 61 genes with a p value less than 0.0014 as well as all genes expressed at a level of above background in all four experiments can be found in the supplemental data (available in the on-line version of this article).
Because the physiological purpose of Lrp is presumed to be the coordination of gene expression levels with the nutritional and environmental conditions of the cell (5,6), it was pleasing to discover that most of the genes affected by Lrp are ones that encode products involved in small molecular and macromolecule synthesis or degradation, as well as gene systems involved in small molecule transport and environmental stress responses. These genes can be sorted into the functional groups shown in Fig. 7; they also are listed in Table VI and discussed below.
Small Molecule Biosynthesis-Among the genes differentially expressed between lrp ϩ and lrp Ϫ strains, 11 are genes required for amino acid biosynthesis. Of these, the ilvG, ilvM, leuB, and serA genes are members of operons previously reported to be regulated by Lrp.
The ilvG and ilvM genes, the first two genes of the ilvG-MEDA operon, encode the two subunits of acetohydroxy acid synthase II, one of three isoenzymes catalyzing the first step of the parallel pathway for L-valine and L-isoleucine biosynthesis. We have previously used an ilvP G ::lacZ construct to measure ␤-galactosidase activities in the same isogenic lrp ϩ and lrp Ϫ strains employed in this study (7). These results showed that ␤-galactosidase was increased 2.5-fold in the lrp Ϫ mutant strain. The data reported here are consistent with this earlier report. We have also described the presence of a constitutive internal promoter, ilvP E , located between the ilvM and ilvE genes of this operon (18). This affect of the internal promoter is apparent in our DNA microarray data; the expression of the operon distal ilvEDA genes is decreased only 1.2-fold.
It is interesting that two other genes of the aspartate family of amino acids previously unknown to be regulated by Lrp appear in this list (19). These are thrL, the leader polypeptide of the threonine operon (20 -23), and asd, the structural gene for aspartyl-semialdehyde dehydrogenase (24,25). This enzyme is involved in the conversion of oxaloacetate to homoserine, a precursor of threonine and isoleucine. These findings suggest the possibility that all of the genes of the aspartate family that convert the TCA cycle intermediate, oxaloacetate, to amino acids might be sensitive to Lrp-mediated regulatory effects.
The serA gene encodes phosphoglycerate dehydrogenase, the first enzyme specific for serine biosynthesis. Newman and colleagues (6, 15) have reported that the transcriptional level of serA is decreased 6-fold in a lrp Ϫ strain. Our results exhibit a similar transcriptional regulation. Newman and colleagues  (15,26) also have reported that the expression level of the leu operon is decreased 11-fold in a lrp Ϫ strain and showed that the growth rate of a lrp Ϫ strain is increased by adding leucine to the growth medium. On the other hand, Landgraf et al. (27) have suggested that Lrp-mediated regulation of the leu operon is indirect and reported a much smaller effect (1.4-fold). Our studies agree with those of Landgraf et al.
Because these known Lrp-regulated genes identified by our experiments are detected with a high level of measurement accuracy and confidence, we can be similarly confident that the expression of other genes in this group are also members of the Lrp regulon. However, the differential expression of these newly identified genes could be the consequences of either primary or secondary Lrp effects. An obvious way to discern whether or not the operons containing these genes are directly regulated by Lrp would be to search for Lrp binding sites in their promoter-regulatory regions (10). Unfortunately, because of the degeneracy of the consensus Lrp binding sequence, this is not possible. Even when a 3 of 15 mismatch is allowed, 60% of all regions 500 base pairs upstream of all E. coli ORFs contain at least one putative Lrp binding site. Thus, it is difficult to determine at this time whether the differential expression of these genes is directly or only indirectly affected by Lrp.
Small Molecule Transport-22 of the 100 genes listed in Table VI are involved in small molecule transport. Of these, 11 have been documented to be regulated by Lrp. Products of the livJ and livKHMGF genes are components of two transport systems with high affinity for leucine. The livJ gene product binds leucine, isoleucine, and valine, whereas the livK gene product is specific for leucine alone. These two systems share a set of membrane components, products of the livHMGF genes. Haney et al. (28) have reported that both of these operons are repressed by Lrp in the presence of high concentration of leucine. Bhagwat et al. (29) have reported that in the absence of leucine the expression of the livJ gene is unaffected by Lrp and that the expression of the livKHMGH operons is activated approximately 15-fold. Because the experiments reported here were also performed in the absence of leucine, we would expect similar results and, in fact, we observe a 2.5-10-fold activation of the genes for the livKHMGF operon. However, our results suggest that Lrp is also responsible for a 2-fold repression of livJ under these growth conditions.
The oppABCDF operon contains genes encoding a periplasmic binding protein and transport permease proteins for a wide range of tripeptide transport systems. Austin et al. (30) have reported that this operon exhibits high constitutive expression in a lrp Ϫ strain. Accordingly, our results show that the expression of the oppA and oppB genes is increased 15-and 20-fold, respectively, but that the expression of the oppC, oppD, and oppF genes are increased to a lesser extent. These data suggest the possibility of an unidentified internal promoter between oppB and oppC.
Four proteins, the malEFG and -K gene products, are required for maltose uptake in E. coli. These four genes are arranged in two operons, malEFG and malK-lamB-malM. Tchetina et al. (31) have reported that transcription of both of these operons are decreased 50 -70% in a lrp Ϫ strain grown in glycerol. Our results demonstrate that the transcription level of both operons also is decreased approximately 80% in a lrp Ϫ strain grown in a glucosesupplemented minimal MOPS medium. However, because malE is the only gene of either operon that passed our statistical cut-off, we cannot be as confident that Lrp also affects the expression of the other genes of these operons (see supplemental data, available in on-line version of this article).
Of the remaining newly identified genes of this class, we find two examples of two genes in the same operon, artP and artI of the artPIQMJ operon and potH and potG of the potFGHI operon. artP and artI are involved in arginine transport (32). The expression of these genes is increased in the lrp Ϫ strain 6.3-and 4.6-fold, respectively. The potH and potG genes are members of the potFGHI operon involved in the transport of putrescine (33). The expression of these genes is decreased ϳ3-fold in the lrp Ϫ strain. In addition to these previously documented systems, our results suggest that transport systems involved in the transport of various dipeptides, carbohydrates, organic acids, alcohols, and inorganic compounds are also influenced by Lrp (34 -42).
Carbon and Energy Metabolism-Besides its influence on the transport of organic acids, Lrp also influences the expression of genes involved in the metabolism of these compounds. For example, the levels of expression of the structural genes for malP and manA are both altered in a lrp Ϫ strain (43)(44)(45). The remaining genes of this group listed in Table VI are involved either in the catabolism of glucose under aerobic conditions, or alternative carbon sources under anaerobic conditions (46 -50).
Macromolecular Biosynthesis-8 of the 100 genes listed in Table VI are involved in macromolecule synthesis. Three of these genes, hns, hupB, and himD, encode proteins that influence the structure and DNA topology of the E. coli chromosome. The remaining five are involved in protein synthesis or degradation. Of these, only one has been previously identified as a Lrp-regulated gene. This gene, lysU, encodes one of the two lysyl-tRNA synthetases in E. coli. Gazeau et al. (51,52) have reported that this gene is repressed 9-fold by Lrp. Under the conditions of our experiments, Lrp represses the expression of the lysU gene 7-fold. Of the four newly identified genes of this group, two, clpA and pepD, are proteases involved in protein degradation (53,54). The remaining two genes, rpmI and rmf, encode ribosome-associated proteins (55)(56)(57)(58).
The hns, hupB, and himD genes are the structural genes for the H-NS, HU, and IHF proteins that are important for the condensation of the chromosome into a nucleoid structure, for restraining negative supercoils, and in several cases for the regulation of gene expression (3, 59 -61). In each case, these genes are repressed by Lrp. It is likely that the effect of Lrpmediated effects on the expression levels of these global regulatory gene products might be responsible for many secondary changes in gene expression levels observed in lrp Ϫ strains.
Regulatory Proteins of Stress Responses-The expression levels of several proteins involved in cellular adaptation to nutritional and environmental assaults are increased 2-3-fold in the lrp Ϫ strain. These include the alternative sigma factors rpoS and rpoE. The rpoS sigma factor, 38 , is a central regulator of many stationary phase-responsive genes. Although it is induced to high levels in early stationary phase cells, it also is expressed, albeit at a lower level, during the exponential growth phase, where it functions as a general stress response element essential for prolonged cell survival (62,63). It is involved in the induction of several genes important for osmotic, oxidative, heat, and DNA damage stress responses (64). The rpoE gene encodes another sigma element, 24 , that also is expressed at a higher level in the lrp Ϫ strain. Although the major functions of coping with thermal stress are encoded by genes transcribed by 32 , genes transcribed by 24 are necessary for survival under extreme temperature stress conditions (65). Interestingly, although the expression level of the rpoH gene for 32 is unaffected in the lrp Ϫ strain, several genes regulated by this sigma factor, such as dnaK, dnaJ, clpA, clpB, clpP. htpG, htpX, gapA, and grpE, 3 all exhibit 2-3-fold increased expression levels in the lrp Ϫ strain.
The rpoE and rseA genes are both members of rpoE-rseABC operon, and the resA gene product is a negative regulator of this operon (66). The remaining genes of this group, phoP, cpxP, aspC, and uspA, also up-regulated 2-3-fold, are similarly involved in stress responses. phoP is a regulatory protein involved in a variety of environmental stress signals including magnesium starvation and nutritional deprivation (67). cpxP encodes a periplasmic protein important for PH tolerance (68,69). ahpC and uspA encode proteins involved in the oxidative stress response (70 -73). sodA encodes a superoxide dismutase also required for survival during oxidative stress conditions (37,74). However, the expression of this last gene of this group is decreased 2.4-fold in a lrp Ϫ strain.
Cell Structure-Of the four genes of this group listed in Table  VI, only the fimA gene has been reported to be regulated by Lrp. The gene product of the fimA gene is the major fimbrial subunit of type I pili. The expression of this gene is controlled by a cis-acting DNA element (switch). Several reports have shown that switching frequency is reduced in IHF and lrp Ϫ strains (75,76). In agreement with these reports, our data show that fimA expression is reduced 4.3-fold in a lrp Ϫ strain.
The remaining genes of this group, ompT, ompX, and slp, are outer membrane proteins involved in nutritional or environmental stress responses. The ompT gene encodes an outer membrane endopeptidase associated with pathogenicity in certain Gram-negative bacteria (77). Its activity is increased during conditions of temperature stress (78). ompX encodes an outer membrane protein required for E activity during temperature stress in some E. coli strains (79). Finally, slp encodes the starvation lipoprotein induced during nutritional deprivation (80). IH-G2490 or Strain IH-G2491-Only those genes exhibiting an expression level greater than zero in all experiments were used for statistical analysis as described above. Gene measurements containing zero expression values were set aside and discussed here. Among this set of genes, 23 genes with zero expression values for all measurements of one genotype, and all values greater than zero for all measurements of another genotype for each experiment, were identified. The significance of these results (Table VII) was analyzed by ranking these genes in ascending order according to their coefficients of variance of the four greater than zero measurements.

Examples of Genes Only Expressed in Either Strain
Four of the 23 genes in Table VII are known Lrp-regulated genes contained in the gltBDF, ilvIH, gcvTHP, and stpA operons. The genes of the of the gltBDF operon encode a regulatory protein (gltF) and the two subunits of glutamate synthase (gltB and D), an enzyme involved in ammonia assimilation. Ernsting et al. (81,82) have reported that this enzyme activity is very low or missing in a lrp Ϫ strain. In agreement with this report, our results show that the mRNA level of gltB is below detection (Table VII) and gltD mRNA (Table VI) is reduced 19-fold in the lrp Ϫ strain. On the other hand, the mRNA level of the regulatory gltF gene is reduced only 1.9-fold (see supplemental data, available in the on-line version of this article). This result suggests the presence of an internal, lrp Ϫ independent promoter between the gltD structure gene and gltF regulatory gene of this operon.
Wang et al. (83) have shown that the ilvIH operon that encodes acetohydroxy acid synthase III of the branched chain amino acid pathway is repressed 30-fold in a lrp Ϫ strain. The repressed level of transcripts of this operon are undetectable above background in the experiments reported here (Table  VII). Lin et al. (15,84,85) reported that the gcvTHP operon that encodes proteins that cleave glycine to produce one-carbon units and ammonia is repressed 20-fold in a lrp Ϫ strain. We find the transcript for one of the genes of this operon (gcvH) is undetectable (Table VII), and the remaining two genes measured with p values higher than our statistical cut-off level are Nylon Filter Data Versus Affymetrix GeneChip Data-When different array formats are used that require different target preparation methods, the magnitudes and sources of experimental errors are surely different. This raises the question of whether or not results obtained from experiments performed with different DNA array formats can be compared with one another, or indeed even whether comparable results can be obtained. To address this question, we have assessed the results obtained from DNA array experiments performed with pre-synthesized nylon filters hybridized with 33 P-labeled cDNA targets prepared from total RNA with random hexamer primers and in situ synthesized Affymetrix GeneChips hybridized with enriched, biotin-labeled, mRNA targets obtained from the same total RNA preparations.
For the GeneChip experiments, the exact same four control and experimental pairs of pooled RNA preparations used in the lrp ϩ versus lrp Ϫ nylon filter experiments described above (Fig.  1) were used for hybridization to four pairs of E. coli Affymetrix GeneChips. However, because of economic considerations, each experiment was not performed in duplicate; hence, only one measurement for each gene was obtained on each chip. Thus, instead of having four measurements for each gene expression level for each experiment (Fig. 1), only one measurement was obtained from each GeneChip (Fig. 2). On the other hand, this single measurement is the average of the difference between hybridization signals from ϳ15 perfect match and mismatch probe pairs. 4 Although these are not equivalent to duplicate measurements because different probes are used, these data do increase the reliability of each gene expression level measurement.
Because only one measurement from one GeneChip was obtained for each genotype for each experiment, it was not possible to distinguish sources of error contributed by differences among GeneChips and from differences among target preparations as we have previously reported for the filter data (10). Nevertheless, it was possible to use the ad hoc control versus control and PPDE computational methods to compare data among the four control GeneChips hybridized with independent biotin-labeled mRNA targets from E. coli strain IH-G2490 (lrp ϩ ). These methods were used to estimate the number of false positives expected at given p value thresholds. These results for the GeneChip data, as well as the nylon filter data, are presented in Tables VIII and V, respectively.
It is clear from these results that the filter data identifies more differentially expressed genes with lower p values and higher confidence levels than the GeneChip data. This is not surprising because, as explained above, each gene measurement level in the filter data set is the average of four duplicate measurements from two separate filters, whereas each gene measurement in the GeneChip data set is based on a single measurement from each experiment. In fact, when the top 100 genes with the lowest p values from the nylon filter and Gene-Chip experiments are compared, only 17 genes are common to both lists. This lack of correspondence is likely the result of the greater variance among the GeneChip measurements and the fact that fundamentally different DNA array formats are compared. However, when a Bayesian statistical framework is applied to the analysis of each data set, the correspondence is nearly doubled and 27 genes are found to be common to both lists. These results further strengthen the conclusions of Long et al. (12) that statistical analyses performed with a Bayesian prior identify genes that are up-or down-regulated more reliably than approaches based on a simple t test when only a few experimental replications are possible.
The GeneChip results described above were obtained from raw data that were background subtracted and normalized to the total signal on each DNA array, and analyzed with the CyberT statistical software. Affymetrix has developed its own empirical algorithms for the analysis of GeneChip data that are commercially available in a software package, Microarray Suite 4.0. Below we compare the identification of differentially expressed genes identified with the CyberT and Microarray Suite 4.0 software.
Because the Affymetrix software allows the comparison of only one GeneChip pair at a time, it was run on each of the four independent experiments comparing lrp genotypes. Each comparison identified between 500 and 700 genes that the Affymetrix software calls as marginally increased or decreased, or increased or decreased (Table IX). However, filtering the results from these four independent experiments identified only 55 genes that the Affymetrix software called differentially expressed in all four experiments. Remarkably, comparison of these 55 genes to the 55 genes exhibiting the lowest p values identified by the CyberT software employing a Bayesian statistical framework revealed 39 genes in common with both lists. Among these were 21 known Lrp-regulated genes.
These results illustrate several important points. First, they stress the importance of replication when only two conditions are compared. Little can be learned about those genes regulated by Lrp from the analysis of only one experiment with one GeneChip pair because an average of 600 genes were identified as differentially expressed, only 55 of which can be reproduced in four independent experiments. Furthermore, in the absence of statistical analysis, it is not possible to determine the confi-  dence level and rank the reliability of any differentially expressed gene measurement identified with the Affymetrix software. This is, of course, important for prioritizing genes to be examined by additional experimental approaches. Finally, and most importantly, these results demonstrate that remarkably similar answers can be obtained from fundamentally different DNA microarray formats when the raw data from each set of experiments are analyzed by the statistical methods employed here. Summary-It is indicative of the power of gene expression profiling experiments that two thirds of the genes measured here with a 97% global confidence were previously unknown to be members of the Lrp regulatory network. Furthermore, nearly one third of these genes are genes of unknown function. As more experiments of this type are performed, as more functions are assigned to the gene products of hypothetical ORFs, and as bioinformatics methods to identify degenerate protein binding sites typical of proteins that bind to many DNA sites are developed, an even clearer picture of the Lrp genetic regulatory network in E. coli will emerge. However, even at this early stage in the development and execution of DNA array technologies and data analysis methods, the results presented here support previous suggestions that the physiological role of Lrp is to monitor the nutritional state of the cell to adjust its metabolism to changing nutritional conditions and, in cooperation with other regulatory networks, to coordinate these changes with the physical environment of the cell.