Global Gene Expression Profiling in Escherichia coliK12

We have used nylon membranes spotted in duplicate with full-length polymerase chain reaction-generated products of each of the 4,290 predicted Escherichia coli K12 open reading frames (ORFs) to measure the gene expression profiles in otherwise isogenic integration host factor IHF+ and IHF−strains. Our results demonstrate that random hexamer rather than 3′ ORF-specific priming of cDNA probe synthesis is required for accurate measurement of gene expression levels in bacteria. This is explained by the fact that the currently available set of 4,290 unique 3′ ORF-specific primers do not hybridize to each ORF with equal efficiency and by the fact that widely differing degradation rates (steady-state levels) are observed for the 25-base pair region of each message complementary to each ORF-specific primer. To evaluate the DNA microarray data reported here, we used a linear analysis of variance (ANOVA) model appropriate for our experimental design. These statistical methods allowed us to identify and appropriately correct for experimental variables that affect the reproducibility and accuracy of DNA microarray measurements and allowed us to determine the statistical significance of gene expression differences between our IHF+ and IHF− strains. Our results demonstrate that small differences in gene expression levels can be accurately measured and that the significance of differential gene expression measurements cannot be assessed simply by the magnitude of the fold difference. Our statistical criteria, supported by excellent agreement between previously determined effects of IHF on gene expression and the results reported here, have allowed us to identify new genes regulated by IHF with a high degree of confidence.

It has been more than forty years since the pioneering studies of Jacob and Monod (1) on the regulation of the genes of the lac operon of Escherichia coli established the basic paradigm for protein-mediated regulation of gene expression. Since then, the molecular mechanisms responsible for the regulation of scores of operons in this organism have been elucidated. However, although a great deal has been learned about the regulation of individual operons, much less is known about the global regulatory mechanisms that coordinate the expression of these operons with one another and with the nutritional and environmental growth state of the cell.
Much of what is known about global gene regulation has been inferred from the analysis of O'Farrell two-dimensional electrophoresis gels for the resolution of individual proteins expressed in cells grown under two experimental conditions. The heat shock-and starvation-induced proteins, for example, were originally identified by this method (2). However, the identification of each of the cellular proteins on these gels is a laborious task, and the estimation of the level of expression of each protein in the cell is not easily quantified. In fact, the expression of only about 250 proteins so far have been characterized by this method.
Classical genetic and biochemical studies have also identified global regulatory proteins that respond to small molecule co-regulators to affect the expression of large sets of genes. These proteins, such as catabolic repressor protein and leucineresponsive regulatory protein (Lrp), modulate the expression of stimulons that encompass several regulons, each of which may contain multiple operons of common function (3). Stimulons are generally regulated by nutritional (e.g. glucose starvation) or environmental (e.g. heat shock) signals. For example, it is well known that catabolic repressor protein and its co-activator, cyclic-AMP, are required for the induction of carbon utilization operons under glucose starvation conditions (4).
Another class of global regulatory proteins, which include members such as H-NS and integration host factor (IHF), 1 are DNA architectural proteins involved in the condensation of the bacterial nucleoid. They are abundant proteins that bind to many sequence-specific but degenerate DNA sites and affect processes that require DNA duplex destabilization such as DNA replication, recombination, and transcription (5)(6)(7)(8)(9)(10). These proteins also affect the expression of many genes (operons), but unlike the global regulators of the previous class, they do not respond to small molecule metabolic co-regulators and there is no obvious metabolic coherence among the genes whose expression they affect.
IHF was initially identified as the product of a gene required for the site-specific integrative recombination of phage into the E. coli chromosome (11). Subsequently, it has been discovered that IHF affects many cell functions including a variety of site-specific recombination events and DNA replication (12). In addition, it was found that IHF influences the expression levels of many genes. For example, Freundlich et al. (13) used O'Farrell two-dimensional electrophoresis gels to demonstrate a difference in the levels of 15-20% of the proteins expressed in IHF mutant and isogenic parent cells grown under conditions comparable with those reported here. The mechanistic role for IHF in these processes has been largely ascribed to its ability to bend DNA to bring distant sites on the bacterial chromosome together for a biological function. In the case of phage , IHF facilitates the integration reaction by bringing distant integrase binding sites into the proximity of the bacterial and phage attachment sites (14). IHF has also been shown to function as a DNA looper protein to facilitate interactions between regulatory proteins bound at upstream sites and RNA polymerase at downstream promoter sites (15,16). Because these functions involve IHF binding to site-specific high affinity sites and because of the high intracellular concentration of this abundant chromosomal organizer protein, these IHF sites are likely saturated under all physiological conditions (17). Thus, unlike other regulatory proteins that bind small molecule effectors that affect their DNA binding properties, IHF functions as an architectural component of DNA structures that affect the constitutive or basal level expression of many promoters. This may explain the lack of any obvious metabolic coherence among the genes whose expression are affected by IHF (12).
It has recently been demonstrated that IHF can also inhibit the transition of supercoiling-induced DNA duplex destabilized (SIDD) sites from a B-form to a partially denatured duplex structure (18 -20). This results in the translocation of the superhelical energy (negative twist) normally absorbed by the SIDD site to another site in a superhelically constrained DNA domain (19,20). In the case of the ilvGMEDA operon, required for the biosynthesis of the branched chain amino acids in E. coli, IHF-mediated translocation of superhelical energy from an upstream SIDD site results in a destabilization of the DNA duplex in the Ϫ10 region of the downstream ilvP G promoter. This supercoiling-dependent, IHF-mediated duplex destabilization in the promoter region facilitates open complex formation and an increase in transcription into the structural genes of this operon (21).
It is known that the global superhelical density of the chromosome varies over a wide range during different phases of the bacterial growth cycle and in response to various types of environmental assaults such as osmotic, temperature, and anaerobic shocks and nutritional upshifts and downshifts. Our previously published results suggest that the effect of IHF on the expression of the genes of the ilvGMEDA operon is to amplify basal level expression of this operon (independently of operon-specific controls) in response to small changes in the global superhelical density of the bacterial chromosome in order to coordinate the capacity for branched chain amino acid biosynthesis with the environmental and nutritional growth conditions of the cell. To determine if this represents a general control mechanism for coordinating the expression of other genes (operons) will require knowledge of the location and thermodynamic stability of each of the SIDD sites on the E. coli chromosome at different physiological superhelical densities encountered under different growth conditions. This information together with the location of each of the high affinity IHF sites and the effects of IHF on global gene expression profiles under these same conditions should facilitate an assessment of the generality of this mechanism. Indeed, calculations to identify the location and thermodynamic stability of all of the SIDD sites on the E. coli chromosome at the global chromosomal superhelical densities observed in stationary phase and aerobically and anaerobically cultured cells growing in glucose minimal MOPS medium are currently in progress (22). 2 We are also currently developing genomic SELEX methods to isolate in vivo cross-linked IHF-chromosomal DNA fragments for hybridization to DNA arrays containing probes for all of the E. coli inter-ORF (upstream regulatory) regions to identify each of the IHF-binding sites. 3 In this report we describe the use of nylon membranes spotted in duplicate with full-length polymerase chain reactiongenerated products of each of the 4,290 predicted E. coli K12 ORFs to measure the gene expression profiles in otherwise isogenic IHF ϩ and IHF Ϫ strains growing in glucose minimal MOPS medium. To evaluate the data generated by these gene expression profiling experiments we used a linear analysis of variance model appropriate for the experimental design employed in this study. These statistical methods allowed us to identify and minimize experimental variables that affect the reproducibility and accuracy of DNA microarray measurements and to determine the statistical significance of observed differences between expression levels of each ORF in these two genotypes. Together with future knowledge of the location and in vivo occupancy of IHF at its high affinity chromosomal binding sites and the location and stability of the SIDD sites around the E. coli chromosome in wild-type and IHF Ϫ strains grown under different environmental and nutritional conditions, these data will allow us to assess the generality of global gene regulation by IHF-mediated translocation of superhelical energy from one site on the chromosome to another.

MATERIALS AND METHODS
Chemicals and Reagents-Avian myeloblastosis virus reverse transcriptase, RNase free DNase I, and Sephadex G-25 Quickspin columns were obtained from Roche Molecular Biochemicals. Ribonuclease inhibitor III was purchased from Panvera/Takara, ultra pure deoxynucleoside triphosphates were from Amersham Pharmacia Biotech, random hexamer oligonucleotides were from New England Biolabs, and [␣ 33 P] dCTP (2-3000 Ci/mmol) was from NEN Life Science Products. DNA filter arrays (Panorama E. coli gene arrays) and 3Ј ORF-specific oligonucleotides were obtained from Sigma-Genosys Biotechnologies. All other chemicals were obtained from Sigma. All reagents and baked glassware used in RNA manipulations were treated with diethylpyrocarbonate.
Bacterial Strains and Growth Conditions-The construction of the isogenic E. coli strains IH100 [ilvP G ::lacZYA] and IH105 [ilvP G ::lacZYA, ⌬himA] used in these experiments have been described (21). In both strains the genes of the chromosomal lac operon are transcribed from the ilvP G promoter, which is activated by the upstream binding of IHF. Cells were grown in 25 ml of MOPS medium (23) containing 0.4% glucose in 125-ml Erlenmeyer flasks at 37°C with constant aeration.
Isolation of Total RNA-Total RNA was isolated from cells at an A 600 of 0.5-0.6. Five-ml samples of cultures of growing cells were pipetted directly into 5 ml of boiling lysis buffer (1% SDS, 0.1 M NaCl, 8 mM EDTA) and mixed at 100°C for 2 min. These samples were transferred to 125-ml Erlenmeyer flasks, mixed with an equal volume of acid phenol (pH 4.3), and shaken vigorously for 6 min at 64°C. After centrifugation, the aqueous phase was transferred to a fresh Erlenmeyer flask, and the hot acid phenol extraction procedure was repeated. The second aqueous phase was extracted with phenol-chloroform-isoamyl alcohol (25:24:1, pH 8) at room temperature and, finally, with chloroform-isoamyl alcohol (24:1). Total RNA was precipitated with two volumes of ethanol in 0.3 M sodium acetate (pH 5.3), washed with 70% ethanol, and redissolved in a 10 mM Tris, 1 mM EDTA solution (pH 8.0). The redissolved RNA was treated with DNase I (20 units in a 200-l reaction mixture containing 10 mM Mg Cl 2 , 1 mM dithiothreitol, and 5 units RNAsin) for 15 min at 37°C and re-extracted first with phenol-chloroform-isoamyl alcohol (25:24:1, pH 8) at room temperature and then with chloroformisoamyl alcohol (24:1). To ensure that the total RNA preparation was free of genomic DNA contamination, the DNase I treatment was repeated a second time. After ethanol precipitation, the purified RNA was again washed with 70% ethanol and redissolved in a 10 mM Tris, 1 mM EDTA solution (pH 8.0). The RNA concentration was determined by absorption at 260 nm. Normally, three 5-ml samples from each culture were processed in parallel.
cDNA Synthesis and Labeling Conditions-For random hexamerprimed cDNA synthesis, 20 g of total RNA and 37.5 ng of random hexamer primers were heated at 70°C for 3 min and quick-cooled on ice. cDNA synthesis was performed at 42°C for 3 h in a 60-l reaction mixture containing the RNA and primer mixture, reverse transcriptase buffer (Roche Molecular Biochemicals), 1 mM each dATP, dGTP, and dTTP, 50 Ci [␣ 33 P]dCTP, 20 units of ribonuclease inhibitor III, and 4 l (88 units) of avian myeloblastosis virus reverse transcriptase. Labeled cDNA was separated from unincorporated nucleotides on Sephadex G-25 spin columns.
ORF-specific oligonucleotide primed cDNA synthesis was performed in the same way except that 2 g of total RNA was mixed with 8 l of ORF-specific primers and unlabeled nucleotides, heated at 90°C for 2 min, and slow-cooled to 42°C. After the addition of ribonuclease inhibitor III, avian myeloblastosis virus reverse transcriptase, and [␣ 33 P]dCTP, the reaction mixture was incubated at 42°C for 3 h. Approximately 50% incorporation of labeled nucleotides usually was achieved with both protocols.
DNA Microarray Hybridization-The nylon filters were soaked in 2ϫ saline/sodium phosphate/EDTA for 10 min and prehybridized in 10 ml of hybridization solution (5ϫ saline/sodium phosphate/EDTA, 2% SDS, 1ϫ Denhardt's solution containing 0.1 mg/ml sheared herring sperm DNA) for 1 h at 65°C. 2-3 ϫ 10 7 cpm of cDNA probe in 500 l of the same solution was heated at 90 -95°C for 10 min, rapidly cooled on ice, and added to 5.5 ml of hybridization solution. The prehybridization solution was removed and replaced with the hybridization solution. Hybridization was carried out for 15 to 18 h at 65°C. Following hybridization, each filter was rinsed with 50 ml of 0.5ϫ saline/sodium phosphate/EDTA containing 0.2% SDS at room temperature for 3 min, followed by three washes in the same solution at 65°C for 20 min each. The filters were partially air-dried, wrapped in Saran Wrap, and exposed to a phosphor screen for 48 -60 h. Filters were stripped by microwaving at half-maximal power in 500 ml of 10 mM Tris solution (pH 8.0) containing 1 mM EDTA and 1% SDS for 20 min. Stripped filters were wrapped in Saran Wrap and stored in the presence of damp paper towels in sealed plastic bags at 4°C.
Experimental Design-The experimental regimen for the four duplicate experiments reported here is diagrammed in Fig. 1. In Experiment 1, Filters 1 and 2 were hybridized with 33 P-labeled, random hexamer generated cDNA fragments complementary to each of three RNA preparations (IH100 RNA1-3) obtained from the cells of three individual cultures of strain IH100 (IHF ϩ ). These three 33 P-labeled cDNA preparations were pooled before the hybridizations. Following Phosphor-Imager analysis, these filters were stripped and hybridized with pooled, 33 P-labeled cDNA fragments complementary to each of three RNA preparations (IH105 RNA1-3) obtained from strain IH105 (IHF Ϫ ). In Experiment 2, these same filters were again stripped, and this protocol was repeated with 33 P-labeled cDNA fragments complementary to another set of three pooled RNA preparations obtained from strains IH100 (IH100 RNA 4 -6) and IH105 (IH105 RNA 4 -6) as described above. Another set of filters (Filter 3 and Filter 4) was used for Experiments 3 and 4 as described for Experiments 1 and 2. This protocol results in duplicate filter data for four experiments performed with the cDNA probes complementary to four independently prepared sets of RNA. Thus, since each filter contains duplicate spots for each ORF and duplicate filters are hybridized for each experiment, four measurements for each ORF were obtained from each of four experiments.
Data Acquisition-A commercial software package obtained from Research Imaging Inc. (DNA ArrayVision) was used to grid the phosphorimaging image, to record the pixel density of each of the 18,432 addresses on each filter, and to perform the background subtraction. 8,580 of the addresses on each filter are spotted with duplicate copies of each of the 4,290 E. coli ORFs. The remaining 9,852 empty addresses were used for background measurements. Since the backgrounds were quite constant, a global average background measurement was subtracted from each experimental measurement, although local background calculations were possible. Greater than four logs of linearity for the phosphorimaging-derived data was observed.
Statistical Methods-The experimental design employed in this study consisted of four independent 33 P-labeled cDNA preparations for each of two genotypes separately hybridized to two filter pairs, with each filter containing every E. coli ORF spotted in duplicate. This design is depicted in Fig. 1. For each spot, a background-subtracted estimate of expression level was obtained and scaled to total counts on the membrane. For any given spot, a number greater than zero (indicating an expression level) or a zero (indicating an expression level lower than background) was obtained. A full statistical model describing this design would be both complex and over-parameterized with respect to the number of expression measures for any given ORF. For these reasons and because we are only interested in testing for differences between genotypes, we opted for a reduced model. This model consists of t tests, which assume that filters are not involved in two-way statistical interactions. The t test evaluates the difference between the means of two groups employing the variance within groups as an error term. The result is that large differences between groups for any given ORF would tend to be declared non-significant if the expression level of that ORF were unreplicable within experimental treatments. Conversely, small differences in expression could be determined to be statistically significant for a given ORF if expression levels for that ORF were replicable within treatments. In short, the test statistic employed here was constructed by scaling the difference in gene expression levels between genotypes relative to the observed variances within genotypes. p values based on this test statistic range from 1.0 for gene expression levels, with identical values to very small p values for expression level differences that are highly significant. A comprehensive discussion of the use of the t test and the modifications applicable to the analysis of DNA microarray data of the type presented here is available at the Genomics at the University of California, Irvine web site. To identify possible sources of experimental error we also used the t test to determine statistical differences among different filters hybridized with the same RNA preparation of the same genotype as well as differences among different RNA preparations of the same genotype hybridized to the same filters. Since these comparisons only involve factors that we expect to be highly replicable, an excess of hybridization signals showing significant differences in gene expression levels would indicate an experimental artifact. Depending on the magnitude of these artifacts, the detection of significant differences between genotypes requires replication over the variables (filters and RNA preparations) that lead to these false positives. The data presented here demonstrate that no experimental artifacts are contributed by filter differences and that experimental artifacts due to differences in RNA preparations of the same genotype can be eliminated by averaging over as few as four independent RNA preparations.
Data Accession-All of the raw and processed data for the experimental results reported here may be downloaded in tabular format through the online version of this paper. 4 4 The supplemental figure shows complete gene expression data for E. coli K12 strains IH100 (IHF ϩ ) and IH105 (IHF Ϫ ). The values in each column are: column 1, gene name; columns 2-5, average of the duplicate gene measurements for each filter hybridized with cDNA pools from strain IH100 for experiments 1-4, respectively; columns 6 -9, average

Random Hexamer Priming of Total RNA for cDNA Probe Synthesis Is Required for Accurate Measurements of Differential Gene Expression Levels in Bacteria-
Since less than 10% of the total RNA in an E. coli cell is mRNA, it was feared that cDNA preparation by random hexamer priming of total RNA for hybridization to DNA microarrays might produce unacceptable backgrounds. We, therefore, used a set of 4,290 unique 25-base pair oligonucleotide primers specific for the 3Ј end of each E. coli ORF (available from Sigma-Genosys) for primerdirected synthesis of ␣ 33 P-labeled cDNA probes. However, filter hybridizations with these cDNA probes detected only 1,760 genes, with at least 2 out of 4 or greater non-zero, backgroundsubtracted measurements on the control (IH100) filters. Equally disturbing, it was often observed that although some genes of a given operon were detected, others were not. For example, hybridization signals above background were detected for only three of the five genes of the ilvGMEDA operon and two of the three genes for the ilvP G ::lacZYA operon. We, therefore, turned to the use of random hexamers for primerdirected synthesis of ␣-33 P-labeled cDNA probes. Filter hybridization with these cDNA probes detected the expression of 2,592 genes with at least 2 out of 4 non-zero, backgroundsubtracted measurements for each of the control (IH100) and experimental (IH105) data sets. Thus, the expression levels of 832 more genes including all of the genes of the ilvGMEDA and the ilvP G ::lacZYA operons were detected with the random hexamer-labeled probes. Furthermore, the expression level of the genes in each operon varied less than 3-fold (see Fig. 4).
The observation that the ORF-specific probes do not detect as many mRNAs as the random hexamer-labeled probes suggested that these probes do not hybridize to about one-third of the mRNAs either because of the hybridization conditions or because they hybridize to themselves or to one another. Furthermore, the wide variation of signals obtained with the ORFspecific-labeled probes for genes of a common operon can be explained by the expectation that a variable amount of ␣-33 P would be incorporated into each ORF because of unequal hybridization efficiencies and different lengths of labeled cDNA fragments. On the other hand, since each mRNA (or mRNA fragment) is randomly primed with the random hexamers, the amount of ␣-33 P label incorporated into each probe should be largely proportional to the ORF length.
To test these interpretations of our results, random hexamers or ORF-specific primers were used for primer-directed synthesis of ␣-33 P-labeled cDNA probes derived from genomic DNA. In the case of the ORF-specific-labeled probes, we would expect that variable amounts of ␣-33 P should be incorporated into each probe because the length of the synthesized probe might significantly exceed the length of the ORF, especially for short ORFs (or might be smaller for long ORFs). Thus, a single probe might extend into adjacent ORFs or ORFs encoded on the other strand. In these cases, the hybridization signal obtained for any given ORF spot on the array should depend on a complex set of parameters including the size of the ORF, the size of the probe fragment, the position of surrounding ORFs, the placement of the labeling primers, and the hybridization conditions. In contrast, since each region of the chromosome is randomly primed with the random hexamers, probes for every ORF should be generated, and the amount of ␣-33 P incorporated into each probe should be largely proportional to the ORF length. The data presented in Fig. 2 confirm these expectations. Hybridization signals for each of the 4,290 ORFs on the array were observed with the random hexamer-labeled probes, whereas hybridization signals for only two-thirds of the ORFs on the array were observed with the ORF-specific primerlabeled probes (the same ratio of ORF-specific versus random hexamer-labeled probe hybridization signals observed with the cDNA probes generated from RNA). Furthermore, as predicted, the data displayed in Fig. 2A show that the hybridization signal for the random hexamer-labeled probes generated from genomic DNA was reasonably proportional to ORF length (r 2 ϭ 0.41), but no significant correlation between ORF length and hybridization signal was observed with the ORF-specific-labeled probes (r 2 ϭ 0.004; Fig. 2B).
An additional complication that might contribute to the disparate results obtained with random hexamer and ORF-specific-labeled probes could be the result of the widely differing degradation rates of the 25-base pair region of each message measured by the ORF-specific primers. It is known that rapid mRNA decay in E. coli is initiated by endonucleolytic cleavages followed by 3Ј to 5Ј exonucleolytic degradation (29,30). Therefore, if the initial endonucleolytic site were adjacent to the 3Ј ORF-specific primer binding site, this region might be rapidly degraded, and little or no steady-state message would be ex-of the duplicate gene measurements for each filter hybridized with cDNA pools from strain IH105 for experiments 1-4, respectively; column 10, number of non-zero IH100 measurements for Experiments 1-4; column 11, number of non-zero IH105 measurements for Experiments 1-4; column 12, average of the values in columns 2-5; column 13, average of the values in columns 6 -9; column 14, the S.D. of the mean for the IH100 values for Experiments 1-4 (columns 2-5); column 15, the S.D. of the mean for the IH105 values for Experiments 1-4 (columns 6 -9); column 16, the value of the t test statistic; column 17, the degrees of freedom associated with the t test; column 18, the ratio of the variances of the IH100 and IH105 measurements; column 19, the p values associated with the differences between the IH100 and IH105 measurements based on the t test distribution; column 20, the ratios of the means of the IH100 and IH105 data, a negative sign implies decreased expression in strain IH105. These data may be viewed and downloaded from the on-line journal (http://www.jbc.org).

FIG. 2. Scatter plot showing the relationship between hybridization signal intensities with 33 P-labeled cDNA probes generated from genomic DNA using random hexamer oligonucleotides (A) or 3 ORF-specific DNA primers (B).
tracted for primer extension labeling of this gene-specific transcript. In fact, we have previously demonstrated that different probe hybridization sites on the same mRNA do exhibit different half-lives (31). On the other hand, the random hexamerlabeling procedure produces RNA-DNA duplexes for primer extension from all of the partial degradation products of each message. Since the exonucleolytic clearance of mRNA degradation products to free nucleotides follows endonucleolytic message inactivation (at a presumably more constant rate), the random hexamers should detect the steady-state level of all of these intermediate degradation products. This suggests that although the functional half-lives of E. coli mRNA are rapid and message-specific, the "clearance" rate for message degradation intermediates must occur at a more constant rate. If this were the case, the relative expression levels of genes measured with the random hexamer-labeled probes would be more closely related to their rates of synthesis and, therefore, their relative abundance in the cell. This conclusion is supported by the fact that a positive correlation between mRNA and protein abundance is observed with the random hexamer but not with the ORF-specific primer data (see below).
Since the microarray hybridization data measured with random hexamer-labeled probes obtained from genomic DNA showed a correlation with ORF length, we considered the possibility of correcting the expression data obtained with random hexamer-labeled cDNA probes obtained from RNA for ORF length. However, because the less than 10-fold variance in ORF lengths contributes less than one percent of the four logs of variance in expression level measurements obtained from the RNA derived probes, the expression data presented here are not corrected for differences in ORF lengths.
In conclusion, these data demonstrate that random hexamer priming for cDNA probe synthesis is required for accurate measurement of gene expression levels in bacteria. This is explained largely by the fact that it is difficult to obtain a set of 4,290 unique primers that hybridize to each ORF with equal efficiency and by the fact that widely differing degradation rates (steady-state levels) can be observed for the 25-base pair region of each message complementary to each ORF-specific primer.
Replication and Appropriate Statistical Analysis Are Required for Determining the Accuracy of DNA Microarray Measurements-A basic problem that is encountered by all types of DNA microarray experiments stems from the fact that thousands of measurements are obtained from a single experiment. This means that if there is any source of experimental error, a Gaussian distribution of these measurements will be observed. For example, if 5,000 measurements are obtained from two DNA microarray experiments performed under identical experimental conditions, then, based on a standard t test distribution, 250 (5%) of the individual measurements are expected to differ sufficiently by chance alone to produce a p value less than 0.05. Now, if 5,000 measurements are obtained from two DNA microarray experiments performed under different experimental conditions and 500 differences are observed at a 95% confidence level (p Ͻ 0.05), half of these differences (250) will be false positives due to chance alone. Therefore, to interpret data from DNA microarray experiments in which thousands of measurements are obtained from a single experiment, it is necessary to employ statistical methods capable of distinguishing chance occurrences from biologically meaningful data. In this respect, an advantage of nylon DNA microarray filters is that they are relatively inexpensive and can be reused several times. It is therefore economically feasible to repeat each experiment a sufficient number of times to obtain statistically reliable data. For example, in the experiments reported here four filters spotted in duplicate with each E. coli ORF were used four times in four separate experiments, resulting in 16 measurements for each ORF for each of two different genotypes (Fig. 1).
To design an appropriate analysis of variance model it is necessary to know the sources of experimental errors. The principal sources of experimental error in these experiments are expected to arise from differences among filters and differences among RNA preparations. Therefore, our experimental strategy was designed to allow us to determine the reproducibility of results obtained from different filters hybridized with the same cDNA preparations or from different cDNA preparations hybridized to the same filters. To assess the reproducibility between different filters, we employed a statistical t test to compare data from each pair of filters hybridized with cDNA probes prepared from the same RNA preparation. In this case, the data for each duplicate ORF measurement on each filter were averaged, the IH100 or IH105 data for Filter 1 were compared with the data on Filter 2, and the IH100 or IH105 data for Filter 3 were similarly compared with the data on Filter 4. Of the 2,592 genes expressed, 13 are expected to exhibit p values Ͻ 0.005 (0.5% of 2,592 genes). Our analyses identified an average of 13 false positives for the same filters hybridized with cDNA preparations from strain IH100 and 3 false positives for the same filters hybridized with cDNA preparations from IH105. Therefore, since no false positives beyond those expected by chance alone result from the use of replicate filters, we conclude that no significant experimental error is contributed by differences among filters.
To ascertain the experimental error contributed by differences in RNA preparations, we employed a statistical t test to compare the data from the same filters hybridized with cDNA probes prepared from different RNA preparations. For example, the data for each duplicate ORF measurement on Filters 1 and 2 of Experiment 1 were compared with the data from Filters 1 and 2 of Experiment 2. Likewise, the data for each duplicate ORF measurement on Filters 3 and 4 of Experiment 3 were compared with the data from Filters 3 and 4 of Experiment 4. At a p value ϭ 0.005, these comparisons revealed an average of 39 false positives between filters hybridized with cDNA preparations from IH100 and 27 false positives between filters hybridized with cDNA preparations from IH105. Thus, differences among RNA preparations account for a slightly elevated false positive rate. The number of false positives due to this error is approximately 2.5 times that expected by chance alone. However, when the data are averaged across RNA preparations, the variance contributed by differences in RNA preparations is minimized, and the number of false positives is decreased. For example, inspection of the p values obtained from a comparison of the data from the IH100 filters of experiments 1 and 3 with the IH100 filters of experiments 2 and 4 identified only 4 false positives at a p value Ͻ 0.005 and no false positives at a p value less than 0.0001. Similar results were obtained when the data from the IH105 filters were compared in this way.
To determine the differential gene expression levels between strains IH100 (IHF ϩ ) and IH105 (IHF Ϫ ), the background-subtracted and normalized IH100 or IH105 measurements for each ORF from each of the four experiments were averaged. These four averaged IH100 and IH105 sets of measurements were analyzed according to the statistical methods described under "Methods and Materials." This analysis identified 23 genes that differ in expression with a p value less than 0.0001 (Table  I; Fig. 3A). Since a comparable analysis of the filters hybridized with identical genotypes revealed no genes that differ with a p value Ͻ 0.0001 (see above), we can be nearly certain that these 23 genes are expressed at different levels in strains IH100 and IH105. However, given the experimental error in these experiments, it is obvious that other genes differentially expressed between these strains will fail to pass this rigorous statistical test. Nevertheless, two genes of the ilvP G ::lacZYA, and ilvG-MEDA operons (lacY and ilvA) previously shown to be regulated by IHF do appear in this limited list (18 -21, 32, 33). Therefore, if our measurements are meaningful, it is expected that the remaining genes of these two operons should exhibit similar expression levels and be similarly regulated. The data in Fig. 4, panels A and B, show that this is the case. The p values for these differences range from 0.00253 for the lacA gene to 0.165 for the ilvM gene. This suggests that reliable data can be obtained for genes that differ with p values considerably higher than 0.0001. Therefore, to facilitate the interpretation of our results, the statistical criterion for significant differences in gene expression levels was lowered to p Ͻ 0.005. This results in the identification of 124 genes that are differentially expressed between strains IH100 and IH105 (Fig. 3B) and includes genes for which we expect an IHF effect (Table II; see Fig. 6). However, raising the p value to Ͻ 0.005 identifies an average of four false positives in the control data sets (IH100 versus IH100 or IH105 versus IH105). Thus, for this experiment we expect four false positives in our set of 124 differentially expressed genes. This means that at this level of statistical accuracy we can be at least 97% confident of the differences we observe.
It should be emphasized that this level of global confidence (97%) is less than the local confidence of each measurement (99.5%) based on the p value (0.005) because of the false positives expected by chance alone given the large number of genes tested for expression differences from high density arrays. It should also be emphasized that increasing the p value threshold to higher levels rapidly increases the number of false positives in the control data sets (IH100 versus IH100 or IH105 versus IH105) relative to the number of genes differentially expressed at the same p value in the experimental (IH105 versus IH100) set and, therefore, decreases the confidence with which differentially expressed genes can be identified.
There Is Little Correlation between the Fold Difference and the Accuracy of Differential Gene Expression Levels Obtained from DNA Microarray Measurements-There is a popular tendency to equate the magnitude of the fold difference between the expression levels of a gene obtained under two experimental conditions and the accuracy of those measurements. This is exemplified by the fact that the manufacturer of the DNA  3. Scatter plots showing the mean of the fractional mRNA levels obtained from eight filters hybridized with 33 P-labeled cDNA probes prepared from total RNA preparations extracted from E. coli K12 strains IH100 (IHF ؉ ) and IH105 (IHF ؊ ). A, the larger red dots identify 23 genes differentially expressed between strains IH100 and IH105 with p values less than 0.0001 (Table I). B, the larger red dots identify 124 genes differentially expressed between strains IH100 and IH105 with p values less than 0.005 (Table II). The dashed red lines demarcate the limits of 2-fold differences in expression levels. microarrays used in these experiments instructs the users of these arrays, "When comparing differences in expression levels between two arrays, a 2-fold difference in expression levels is considered significant." The data reported in Table II demonstrate that this need not be the case. These data illustrate that p values for small differential expression ratios can be much lower than p values for large differential expression ratios. This lack of a positive correlation between fold difference and significance is even more dramatically illustrated in the scatter plot shown in Fig. 3B. Here the expression level for each of the 2,592 expressed genes in IH100 cells is plotted against the expression level of these genes in IH105 cells (small blue dots). The 124 genes differentially expressed with a probability value based on a t test distribution less than p ϭ 0.005 (Table II) are identified as large red dots. The parallel lines demarcate the 2-fold boundary on either side of the mean for all measurements. 23 of the 124 differentially expressed genes (16%) with a p value Ͻ 0.005 show less than a 2-fold difference. Conversely, only 29 of the 197 genes that are more than 2-fold decreased and only 95 of the 366 genes that are increased more than 2-fold in strain IH105 can be ascertained to be differentially expressed with a 97% or greater level of confidence. Finally, a further demonstration that there is little correlation between accuracy and fold difference is provided by the observation that there is only a 25% correspondence between a list of the 100 genes with the greatest fold difference in expression levels between strains IH100 and IH105 and a list of the 100 genes with the lowest p values. Thus, the significance of differential gene expression measurements cannot be assessed simply by the magnitude of the fold difference between two experimental conditions.
A Positive Correlation between mRNA Level and Protein Abundance Is Observed in E. coli.-VanBogelen et al. (2) used metabolic labeling and high resolution two-dimensional gel electrophoresis to measure the cellular levels of 80 proteins growing in the same medium used for the experiments reported here. A comparison of these protein levels with our mRNA measurements shows a good correlation (r p ϭ 0.67; Fig. 5). Thus, at least with this comparison with a limited number of highly expressed proteins, we observe a reasonable correspondence between cellular mRNA levels and protein abundance in E. coli. A similar correspondence between cellular mRNA levels and protein abundance has been reported for Saccharomyces cerevisiae (34). These results support our suggestion that the enzymatic clearance of mRNA degradation intermediates to free nucleotides proceeds at a relatively constant rate and that the steady-state levels of these degradation intermediates are proportional to their rates of synthesis and, therefore, their relative abundance in the cell (see above).

Reliability of Gene Expression Profile Results
Observed between (IHF ϩ ) IH100 and (IHF Ϫ ) IH105 Strains-To assess the reliability of the mRNA expression levels inferred from the DNA microarray experiments reported here, we examined the effects of IHF on the expression of the genes of the ilvGMEDA operon of E. coli. We have previously demonstrated that the promoter regulatory region of the ilvGMEDA operon contains two IHF-binding sites. IHF binding to a site located 92 base pairs upstream of the transcriptional start site activates in vitro and in vivo transcription initiation from the downstream promoter 3-5-fold by a DNA supercoiling-dependent mechanism (19,20). We have also shown that IHF binding to another site in the leader region reduces in vitro and in vivo transcription through the leader-attenuator region into the structural genes of this operon about 2-fold by enhancing transcription termination at the attenuator site (32). Thus, in an IHF Ϫ strain, transcription into the attenuator is decreased about 4-fold, but transcription through the attenuator into the structural genes is increased about 2-fold, resulting in an overall increase in the expression of the downstream genes of about 2-fold. To monitor these effects of IHF at both of its binding sites in the ilvGMEDA operon, we replaced the lac promoter regulatory region of the lac operon with a portion of the ilvP G promoter regulatory region of the ilvGMEDA operon that contains the upstream IHF site but lacks the attenuator and the downstream IHF-binding site. Therefore, in an IHF Ϫ strain, transcription from the ilvP G promoter directly into the structural genes of the lac operon is expected to be decreased about 4-fold, but transcription through the attenuator into the structural genes of the wild-type ilvGMEDA operon is expected to be decreased only 2-fold. The data in Fig. 4 show that these expected results are observed. Transcription from the ilvP G promoter directly into the structural genes of the lac operon is decreased 4.14 Ϯ 0.16-fold (Fig. 4A), but transcription through the attenuator into the first two structural genes of the ilvG-MEDA operon is decreased only 2.52 Ϯ 0.06-fold (Fig. 4B). Also, as expected, the transcriptional level of the promoter-attenuator distal genes (ilvE, -D, and -A) of this operon are decreased only 1.47 Ϯ 0.13-fold due to the activity of a previously characterized internal, IHF-independent promoter, ilvP E , preceding the ilvE structural gene (35). These results agree very well with independent transcript level measurements (32) and reporter enzyme assays (21). Furthermore, our ability to detect the effect of the internal promoter on the expression levels of the promoter distal genes of this operon demonstrates that small differences can be accurately measured employing the methods described in this work.
To confirm our ability to accurately measure small differences in gene expression levels and to further verify the correspondence between our measured mRNA levels and protein levels, we assayed the activities of several enzymes in strains IH100 and IH105 and compared the ratios of these activities to the ratios of their cognate mRNA levels in these two strains. These data are presented in Table III.
Since the observations reported here accurately describe the experimentally determined effects of IHF on the expression of genes of the ilvGMEDA and ilvP G ::lacZYA operons and since the differential expression levels of three of the five genes of the ilvGMEDA operon and two of the genes of the ilvP G ::lacZYA FIG. 4. Effects of IHF on the differential expression of the genes of the ilvP G ::lacZYA and ilvGMEDA operons in E. coli K12 strains IH100 and IH105. The mean Ϯ S.D. expression levels in E. coli K12 strain IH100 (black bars, IHF ϩ ) and IH105 (white bars, IHF Ϫ ) expressed as a fraction of total mRNA. Asterisks identify genes differentially expressed with p values less than 0.005.

TABLE II
Genes differentially expressed between E. coli K12 strains IH100 (IHF ϩ ) and IH105 (IHF Ϫ ) with a p value less than 0.005 The data are presented as the average (Avg) and S.D. of four independent gene expression measurements expressed as a fraction of the total hybridization signal (total mRNA) on each DNA microarray filter. The p values are calculated on the basis of the t test distribution. Positive fold differences indicate increased gene expression in strain IH105. Negative fold differences indicate decreased gene expression in strain IH105.  Transport and binding proteins operon in strains IH100 and IH105 are measured with p values less than 0.005, it is reasonable to conclude that equally accurate measures of differential expression can be obtained for other genes at this level of statistical rigor. The mean, S.D., p values, and fold change for the 124 genes differentially expressed between strains IH100 and IH105 with a p value less than 0.005 are listed according to their metabolic functions in Table II. The functions of 64 of these genes have been determined, and putative functions have been assigned to 11 of the remaining 60 genes. IHF negatively regulates 95 and positively regulates 29 of these genes.

Identification of Putative IHF-binding Sites in the Promoter Regulatory Regions of Genes Differentially Expressed in Strains
IH100 and IH105-To identify those genes in Table II differentially expressed in strains IH100 and IH105 most likely as a direct consequence of IHF-mediated effects on transcription initiation, 500 base pairs upstream of the ORF for each of these genes were examined for high affinity IHF-binding sites. Because the core consensus recognition sequence for IHF-binding sites is degenerate (5Ј-(A/T)ATCAANNNNTTR-3Ј; N ϭ any nucleotide and R ϭ purine) and because sequence variable structural properties of the DNA duplex are important for high affinity IHF binding, these sites are difficult to identify (8). For example, a search of the E. coli chromosome for an 11 out of 13 match for the degenerate IHF core consensus sequence identified nearly 40,000 potential binding sites. It was, therefore, necessary to define more restrictive criteria for the identification of putative IHF-binding sites. We, therefore, demanded a 12 out of 13 match with the core consensus sequence and required that at least 10 out of 15 of the base pairs immediately upstream of the core sequence were AT base pairs. These are very stringent criteria that certainly identify high affinity IHFbinding sites; in fact, they exceed the criteria for many documented IHF-binding sites such as the IHF site in the leaderattenuator region of the ilvGMEDA operon. Also, IHF-binding sites that perform a DNA looper function located farther than 500 base pairs upstream of an ORF would not be identified in our search. Nevertheless, these stringent criteria identify IHF-binding sites upstream of 46 genes (operons) differentially regulated in strains IH100 and IH105 with a p value less than 0.005 (Table II). The locations of these putative IHF-binding sites as well as previously documented sites are shown in (Fig. 6).
Examples of Genes Only Expressed in Either Strain IH100 or Strain IH105-The single genotypic difference between the strains used for the studies reported here is a deletion in strain IH105 of the himA gene, the structural gene for the ␣-subunit of IHF. As expected, the data in Table IV show that no himA mRNA is detected in strain IH105, but it is detected in the himA wild-type strain IH100. Indeed, himA mRNA was detected in all four IH100 mRNA pools, but no himA mRNA was detected in any of the IH105 mRNA pools. Although no p value can be calculated when no mRNA for a gene is detected, these data suggest that the expression of a gene in all of the mRNA samples of one strain and no expression in any of the mRNA samples of the other strain likely represents a gene that is either strongly repressed by IHF or strongly dependent upon IHF for its expression.
In addition to himA, the data in Table IV show that six other genes are consistently expressed only in strain IH100 (putative IHF-dependent genes), and eight are consistently expressed only in strain IH105 (putative IHF-repressed genes). Two genes in Table IV that appear to require IHF for expression are the fimI and fimC genes. These genes are members of a large operon (fimBEAICDFGH). Although none of the genes of this operon are differentially expressed with a p value less than 0.005, they are all expressed in strain IH100 and are either nondetectable or expressed at very much lower levels in strain IH105. This operon is expressed from two documented transcription start sites located approximately 150 and 250 base pairs upstream of the first ORF of this operon, and a putative IHF site is observed near the upstream promoter (Fig. 6). This arrangement of tandem promoters and an IHF-binding site is similar to that observed in the P L and ilvGMEDA tandem promoter regions. In these cases, transcription from the up-stream promoter is repressed, and transcription from the downstream promoter is activated (36). These observations demonstrate that potentially important biological information might be missed if data analysis is limited to strict statistical measures.
Dps is an abundant, nonspecific E. coli DNA-binding protein important for protection from hydrogen peroxide-induced DNA damage. In log phase cells growing in rich medium, expression of the dps gene from a 70 promoter requires the OxyR protein (37). However, it has been shown that as cells enter the stationary phase, the dps gene is not induced by oxidative stress (even though OxyR is present), and its expression requires IHF binding to a documented site upstream of a S promoter (Fig. 6). Our results show that during log phase growth in glucose-supplemented minimal MOPS medium in a wild-type strain (IH100) no oxyR gene expression can be detected, and as previously reported (38), the rpoS ( S ) gene is expressed at an intermediate level. Under this same growth condition, dps is expressed in an IHF ϩ strain but is not expressed in an IHF Ϫ strain (Table IV). Thus, our data suggest that during log phase, as in stationary phase growth in minimal medium, IHF is required for low level transcription of the dps gene from a S promoter. The criteria used in this report did not identify any putative high affinity IHF-binding sites upstream of the remaining genes in Table IV that require IHF for expression (rfaJ, nac, yfiG).
Putative IHF-binding sites were identified upstream of three of the eight genes completely repressed by IHF in strain IH100 (b1112, b3592, and b2984; Table IV and Fig. 6). One of these, b1112, is a member of what appears to be a divergently transcribed operon (Fig. 6). The location of the transcription start sites and the putative IHF-binding site in the approximately 200-base pair promoter regulatory between the oppositely oriented b1111 and b1112 ORFs is nearly the same as the promoter and IHF site arrangement between the similarly spaced divergently transcribed cpd and cysQ genes (Fig. 6). In both cases, both genes are repressed by IHF binding in the divergently transcribed region. This pattern of gene regulation suggests that the cpd-cysQ and b1111-b1112 genes are divergently transcribed operons. Typical of other genes whose expression is affected by IHF, no functional correlation is apparent for these two sets of genes.
Examples of Direct Effects of IHF on Gene Expression Profiles in Strains IH100 and IH105-Although a remarkably few IHFbinding sites that affect gene expression during exponential growth in minimal medium have been documented, several of these genes appear in Tables II and IV. For example, it is known that the manganese superoxide dismutase sodA gene is repressed by IHF binding to four sites (Fig. 6) in the promoter region (39,40). However, no evidence for the regulation of the iron superoxide dismutase sodB gene has been reported. Our data shows that the expression of both sodA and sodB is increased in strain IH105 (Table II).  The ndh gene, which encodes NADH dehydrogenase II, is an example of a gene that is known to be regulated by both ArcA and IHF. A direct role for ndh repression by IHF binding at three sites in the promoter region (Fig. 6), consistent with its increased expression in strain IH105 (Table II), has been reported by Green et al. (41).
Freundlich et al. (13) present evidence that IHF mutants exhibit increased expression and altered osmoregulation of OmpF, a major E. coli outer membrane protein. They have identified upstream IHF-binding sites centered at base pair positions Ϫ179 and Ϫ61 of the ompF promoter regulatory region. They have also reported that the addition of IHF to a purified in vitro transcription system inhibits ompF transcription by altering how OmpR, a positive activator required for ompF expression, interacts with the ompF promoter. In contrast, our results suggest that IHF acts as an activator of ompF expression (Table II).
Two IHF-binding sites have been documented upstream of the promoter for the himD gene. Aviv et al. (42) show that expression of the monocistronic himD operon, which encodes the structural gene for the ␤ subunit of IHF, is negatively autoregulated by an intact, heterodimeric IHF, and our results show that the expression of the himD gene is increased in the IHF Ϫ strain IH105 (Table II). However, since this derepression is only 1.7-fold, it is possible that in the absence of the himAencoded ␣-subunit for the formation of himA-himD encoded ␣␤-heterodimers, himD-encoded ␤␤-homodimers can function to autoregulate himD expression. Indeed, several examples of the in vivo formation of ␤␤-homodimers functionally competent to recognize and bind to natural IHF-binding sites have been reported (43)(44)(45)(46).
In addition to the above examples with documented IHFbinding sites, reporter gene and other assays have been used to identify other IHF-regulated genes. Some of these genes appear in Table II, and we have identified putative IHF-binding sites for several of them (Fig. 6). For example, a well known FIG. 6. Genes differentially regulated in E. coli K12 strains IH100 and IH105 with a p value less than 0.005 that contain documented or putative high affinity IHF-binding sites. ORFs of genes (operons) are represented as bars. The 500 base pairs upstream of each ORF is represented by a straight line; tick marks are spaced at 100-base pair intervals. Open bars denote predicted genes. Gray bars denote documented genes. Open arrows identify the position of pre- dicted transcriptional start sites. Black arrows identify the position of documented transcriptional start sites. Open boxes identify the position of predicted IHF-binding sites. Black boxes identify the position of documented IHF-binding sites. The level of expression of each gene in strain IH105 relative to its level of expression in strain IH100 is shown above each gene. The operon organizations, the positions of the transcriptional start sites, and the documented IHF-binding sites were obtained either from GenBank or RegulonDB. regulatory function for IHF involves its role in conjugal transfer of F plasmid DNA by affecting transcription of the transfer (tra) operon (47). Indeed, our results show that all three of the chromosomally encoded tra8_1, tra8_2, and tra8_3 genes contained in the chromosome of our strains are repressed by IHF, and putative IHF-binding sites are observed in the promoter region of each gene. The data in Table II show that IHF affects the expression of one known global regulatory gene (arcA), several operon-specific regulatory proteins, and 11 genes encoding putative regulatory proteins. This suggests that IHF might indirectly affect the expression of many genes via its effect on the expression of regulatory genes. At this time, we are aware of a previously demonstrated direct effect of IHF on the regulation of only one of these regulatory genes, arcA. ArcA is a global regulator protein for genes involved in anaerobic carbon metabolism (48). Reporter enzyme assays and nested deletion experiments have suggested the presence of an IHF site in the arcA promoter regulatory region and shown that arcA gene expression is elevated more than 3-fold in an IHF Ϫ E. coli strain growing in glucose minimal medium under aerobic conditions. 5 We have identified an IHF-binding site in the arcA promoter regulatory region, and our data show a 3.9-fold increase in arcA mRNA expression in strain IH105 growing under similar conditions. Gunsalus and co-workers (49 -52) also use reporter enzyme assays to examine the expression of several tricarboxylic acid cycle (fumA, gltA, icdA, mdh, mur, and sucA) and respiratory genes (atp, cydA, and cyoA) in IHF ϩ and IHF Ϫ strains. In each case the small mRNA expression level differences (often less than 2-fold) between our IHF ϩ and IHF Ϫ strains agree well with their reporter enzyme assay data. However, because the expression of each of these genes is also affected by ArcA protein, it is unclear whether or not these small IHF effects are direct or indirect. On the other hand, ndh is an example of a gene that is known to be regulated by both ArcA and IHF. In this case, a direct role for ndh repression by IHF, consistent with its increased expression in strain IH105 (Table II), has been reported by Green et al. (41).
Other studies in Salmonella typhimurium establish that the arcA gene product activates the expression of the cob operon required for cobalamin synthesis (53). Therefore, it is again unclear whether or not the increases in the cobU and cobT genes reported in Table II are the result of direct or indirect effects of IHF. However, the fact that we find a putative IHF site near the promoter of the cob operon suggests a direct role for IHF in the regulation of this operon.
The remaining genes in Fig. 6 are genes identified by this work that are differentially expressed between strains IH100 and IH105 with a p value less than 0.005 and contain putative high affinity IHF-binding sites. The remaining genes in Table II that do not contain either documented or putative IHF-binding sites may represent genes that are indirectly affected by IHF.