 |
INTRODUCTION |
During the last 50 years, a great deal of knowledge about the
regulation of gene expression in Escherichia coli has been
obtained. We now know that the expression of genetic information is
regulated at three hierarchical levels: global control of basal level
gene expression by chromosome structure, control of regulons and
stimulons by global regulatory proteins, and operon-specific controls
(1, 2). At the most general level, the expression of all genes is
regulated by DNA supercoiling-dependent mechanisms that
affect the topology of the entire chromosome (3). At the next level, large groups of genes are regulated by abundant regulatory proteins with rather degenerate binding site specificity that, in cooperation with operon-specific controls, regulate often-overlapping groups of
metabolically related operons, called regulons and stimulons, in
response to environmental or metabolic signals. At the most basic
level, individual genes or operons are regulated by less abundant
proteins that bind in a site-specific manner to one or a few sites to
regulate single genes or operons. Isolated examples of each level of
control have been described. However, the definition of these
hierarchical control levels in a depth sufficient to understand genetic
regulatory networks on a global scale, all the way from specific
circuits up to the complete regulatory network of the cell, remains to
be elucidated (4). Before we can infer and model these regulatory
networks, individual components at each hierarchical level must be
identified. In other words, a more complete definition of the genes of
specific regulons and stimulons must be obtained. It is now possible to
obtain much of this information using high-throughput technologies such
as DNA microarrays.
The purpose of the work presented here is to identify the network of
genes that are differentially regulated by the global E. coli regulatory protein, leucine-responsive regulatory protein (Lrp),1 during steady state
growth in a glucose supplemented minimal salts medium. Lrp is a
DNA-binding protein that has been reported to affect the expression of
approximately 55 genes.2 In most
cases, Lrp has been reported to activate operons that encode genes for
biosynthetic enzymes and repress operons that encode genes for
catabolic enzymes (5, 6). The intermediary metabolite,
L-leucine, is required for the binding of Lrp at some of
its DNA target sites; however, at other sites L-leucine
inhibits DNA binding, and at still other sites it exerts no effect at
all. Although the physiological purpose of Lrp-mediated gene regulation remains unclear, it has been suggested that it might function to
coordinate cellular metabolism with the nutritional state of the
environment by monitoring the levels of free L-leucine in the cell. The experiments reported here were carried out in the absence
of exogenous L-leucine.
Although many data analysis techniques have been applied to DNA
microarray data, this field is still evolving and has not yet reached a
level of maturity. Therefore, much of the emphasis of the work reported
here is directed toward the assessment of methods to identify
differentially expressed genes with a high level of confidence. In
particular, we apply a Bayesian statistical framework to derive a
regularized estimate of the standard deviation of the level of
expression of each gene in each condition based on a limited number of
replications, and an algorithm to compute a posterior estimate of
differential expression for each gene to estimate the global false
positive rate specific for each DNA microarray experiment.
 |
MATERIALS AND METHODS |
Chemicals and Reagents--
Avian myeloblastosis virus reverse
transcriptase, ATP
S, glycogen, and Sephadex G-25 Quickspin columns
were obtained from Roche Molecular Biochemicals. Ribonuclease inhibitor
III was from Panvera/Takara. Ultrapure deoxynucleoside triphosphates
and DNase I were from Amersham Biosciences. Random hexamer
oligonucleotides and T4 polynucleotide kinase were from New England
Biolabs. [
33P]dCTP (2-3000 Ci/mmol) was from
PerkinElmer Life Sciences. DNA filter arrays (Panorama E. coli gene arrays) were from Sigma-Genosys Biotechnologies.
DNA-free kit and 5 M NaCl RNase-free, DNase-free solution
were from Ambion, Inc. 16 S rRNA-specific primers, 23 S rRNA-specific
primers, and Biotin-Oligo 948 (high performance liquid
chromatography-purified) oligonucleotides were from Operon. MMLV
reverse transcriptase, dithiothreitol, and ribonuclease H (RNase H)
were from Epicentre Technologies. RNeasy total RNA isolation kit and
RNA/DNA mini column kit were from Qiagen. Polyethylene oxide-iodoacetyl-biotin, ImmunoPure NeutrAvidin,
streptavidin, and 10% Tween 20 were from Pierce. Novex XCell
SureLockTM MiniCell and 4-20% TBE gel were from
Invitrogen. 5× sucrose gel loading dye was from Amresco. SYBR Gold and
R-phycoerythrin streptavidin were from Molecular Probes. Acetylated
bovine serum albumin (BSA) solution and phosphate-buffered saline (pH
7.2) were from Invitrogen. All other chemicals were obtained from Sigma.
Bacteria Strains and Growth Conditions--
Strain IH-G2490
(ilvPG::lacZYA) was
constructed by ligating a 515-bp EcoRI-BamHI DNA
fragment containing a 494-bp ilvGMEDA-derived HinFI fragment (base pair position
245 to +249) into the
EcoRI and BamHI sites of the
lacZ-truncated pRS551
(yielding the reporter plasmid
pRSG2490) and integrating this reporter plasmid construct into the
bacterial chromosome of the polA-deficient strain, NO3434, as described previously (7). An isogenic lrp derivative of strain IH-G2490 was created by generalized P1 transduction of the
lrp-35::Tn10 allele into this strain according to
the methods of Miller (8) to yield strain IH-G2491
(ilvPG::lacZYA,
lrp::Tn10). The genes of the chromosomal
lac operon are transcribed from the ilvPG promoter in both strains, which is
repressed by the binding of Lrp in the leader-attenuator region
upstream of the ilvG translational start site. Cells were
grown in 50 ml of MOPS medium (9) containing 0.4% glucose in 250-ml
Erlenmeyer flasks at 37 °C as described previously (10).
Isolation of Total RNA--
Total RNA was isolated from cells at
an A600 of 0.5-0.6. Ten-ml samples of
log phase cells were pipetted directly into 10 ml of boiling lysis
buffer (1% SDS, 0.1 M NaCl, 8 mM EDTA) and mixed at 100 °C for 1.5 min. These samples were transferred to 125-ml Erlenmeyer flasks, mixed with an equal volume of hot acid phenol
(pH 4.3), and shaken vigorously for 6 min at 64 °C. After centrifugation, the aqueous phase was transferred to a fresh Erlenmeyer flask and the hot acid phenol extraction procedure was repeated. The
second aqueous phase was extracted with phenol-chloroform-isoamyl alcohol (25:24:1; pH 4.3) at room temperature, and twice with chloroform-isoamyl alcohol (24:1). Total RNA was precipitated with two
volumes of ethanol in 0.3 M NaOAc (pH 5.3), washed with 70% ethanol, and redissolved in a 10 mM Tris, 1 mM EDTA solution (pH 8.0). Residual genomic DNA was removed
with the DNA-free kit of Ambion Inc. according to the instructions from
the manufacturer. The RNA concentration was determined by absorption at
260 nm. In all cases, independent 10-ml samples from three separate
cultures were processed in parallel.
cDNA Synthesis and Target Labeling Conditions for the Nylon
Array Experiments--
For random hexamer-primed cDNA synthesis,
20 µg of total RNA and 37.5 ng of random hexamer primers were heated
at 70 °C for 3 min and quickly cooled on ice. cDNA synthesis was
performed at 42 °C for 3 h in a 60-µl reaction mixture
containing: RNA and primer mixture; reverse transcriptase buffer
(Roche); 1 mM amounts each of dATP, dGTP, and dTTP; 50 µCi of [
-33P]dCTP; 20 units of ribonuclease
inhibitor III; and 4 µl (88 units) of avian myeloblastosis virus
reverse transcriptase. Labeled cDNA targets were separated from
unincorporated nucleotides on Sephadex G-25 Quickspin columns.
mRNA Enrichment and Target Labeling Conditions for the
Affymetrix GeneChipTM Experiments--
To enrich the
proportion of mRNA in the total RNA preparation, 300 µg of total
RNA from IH-G2490 (lrp+) or IH-G2491
(lrp
) was prepared as described above. Each
300-µg total RNA preparation was split into 12 aliquots to increase
the efficiency of the enrichment procedure. All reactions were
performed in PCR tubes in a thermocycler. For each reaction, 25 µg of
total RNA were mixed with 70 pmol of a rRNA-specific primer mix in a
final volume of 30 µl. Each specific primer mix included three
specific primers for 16 S rRNA (5'-CCTACGGTTACCTTGTT-3',
5'-TTAACCTTGCGGCCGTACTC-3', and 5'-TCCGATTAACGCTTGCACCC-3') and five
specific primers for 23 S rRNA (5'-cctcacggttcattagt-3', 5'-CTATAGTAAAGGTTCACGGG-3', 5'-TCGTCATCACGCCTCAGCCT-3',
5'-TCCCACATCGTTTCCCAC-3', and 5'-CATGGAAAACATATTACC-3'). This
mixture was heated to 70 °C for 5 min and quickly cooled to 4 °C.
10 µl of 10× MMLV reverse transcriptase buffer (0.5 M
Tris-HCl (pH 8.3), 0.1 M MgCl2, and 0.75 M KCl), 5 µl of 10 mM dithiothreitol, 2 µl
of 25 mM dNTPs mix, 3.5 µl of 20 units/µl SuperRNasin,
6 µl of 50 units/µl MMLV reverse transcriptase, and water were
added to each tube to a final volume of 100 µl. The reactions were
incubated at 42 °C for 25 min, and incubation was continued at
45 °C for 20 min for cDNA synthesis. To remove the rRNA moiety
from the rRNA/cDNA hybrid, 5 µl of 10 units/µl RNase H was
added and the mixture was incubated at 37 °C for 45 min. RNase H was
inactivated by heating at 65 °C for 5 min. Newly synthesized
cDNA was removed by incubation with 4 µl of 2 units/µl DNase I
and 1.2 µl of 20 units/µl SuperRNasin at 37 °C for 2 h.
Four reactions were combined for RNA cleanup with a single Qiagen
RNeasy mini column. The quantity of enriched mRNA was measured by
absorbance at 260 nm. A typical yield is 10-20 µg of RNA from 300 µg of total RNA constituting a 10-20-fold enrichment of mRNA to rRNA.
For the RNA fragmentation, a maximum of 20 µg of RNA was added to a
PCR tube containing 10 µl of 10× NEB buffer for T4 polynucleotide kinase in a final volume of 88 µl. The tube was incubated at 95 °C
for 30 min and cooled to 4 °C.
For the RNA 5'-thiolation and biotin-labeling reaction, 2 µl of 5 mM ATP
S and 10 µl of 10 units/µl T4 polynucleotide
kinase were incubated with the fragmented RNA at 37 °C for 50 min.
The reaction was inactivated by heating to 65 °C for 10 min and
cooled to 4 °C. Excess ATP
S was removed by ethanol precipitation.
Fragmented thiolated RNA was collected by centrifugation in the
presence of glycogen (0.25 µg/µl) and resuspended in 90 µl of
distilled water (dH2O). 6 µl of 500 mM
MOPS (pH 7.5) and 4.0 µl of 50 mM polyethylene
oxide-iodoacetyl-biotin were added to the fragmented thiolated
RNA and incubated at 37 °C for 1 h. The biotin-labeled RNA was
isolated by ethanol precipitation, washed twice with 70% ethanol, and
dried and dissolved in 20-30 µl of molecular biology grade water.
The quantity of the biotin-labeled RNA was measured by absorbance at
260 nm. The total yield for the entire procedure is typically 2-4 µg
of biotin-labeled RNA from 300 µg of total RNA. The efficiency of RNA
fragmentation and biotin labeling can be monitored with a gel shift
assay where the biotin-labeled RNA is pre-incubated with avidin prior
to electrophoresis. Biotin-labeled RNA is retarded during
electrophoresis because of the avidin-biotin interaction. The position
of the RNA in the gel addresses the fragmentation efficiency. The
amount of shifted RNA indicates the efficiency of the biotin labeling.
Inefficiencies in either of these parameters should be addressed before
proceeding to the hybridization step.
Hybridization to Nylon Filters--
The nylon filters were
soaked in 2× SSPE (20× SSPE contains 3 M NaCl, 0.2 M NaH2PO4, and 25 mM
EDTA) for 10 min and prehybridized in 10 ml of prehybridization
solution (5× SSPE, 2% SDS), 1× Denhardt's solution (50×
Denhardt's solution contains 5 g of Ficoll, 5 g of
polyvinylpyrrolidone, 5 g of bovine serum albumin, and
H2O to 500 ml), and 0.1 mg/ml sheared herring sperm DNA)
for at least 1 h at 65 °C. 5-7 × 107 cpm of
cDNA targets in 500 µl of prehybridization solution were heated
at 95 °C for 10 min, rapidly cooled on ice, and added to 5.5 ml of
prehybridization solution. The prehybridization solution was removed
and replaced with the hybridization solution. Hybridization was carried
out for 15-18 h at 65 °C. Following hybridization each filter was
rinsed with 50 ml of 0.5× SSPE containing 0.2% SDS at room
temperature for 3 min, followed by three washes in the same wash
solution at 65 °C for 20 min each. The filters were partially air
dried, wrapped in Saran Wrap, and exposed to a phosphor screen for
15-30 h. Filters were stripped by microwaving at 30% maximal power
(1400 watts) in 500 ml of 10 mM Tris solution (pH 8.0)
containing 1 mM EDTA and 1% SDS for 20 min. Stripped
filters were wrapped in Saran Wrap and stored in the presence of damp paper towels in sealed plastic bags at 4 °C.
Hybridization to Affymetrix GeneChips--
For hybridization of
biotinylated RNA targets to the Affymetrix GeneChips, 2-4 µg of
fragmented biotin-labeled RNA of IH-G2490 (lrp+)
and IH-G2491 (lrp
) were used for each
GeneChip. The hybridization solution for each array was prepared with
100 µl of 2× MES hybridization buffer (200 mM MES, 2 M NaCl, 40 mM EDTA, and 0.02% Tween 20), 1 µl of 100 nM Biotin-Oligo 948 (5'-biotin-GTCAAGATGCTACCGTTCAG-3'), 2 µl of 10 mg/ml herring sperm
DNA, 2 µl of 50 mg/ml BSA, and 2-4 µg of fragmented biotin-labeled
RNA and brought final volume to 200 µl with molecular biology grade
water. The GeneChip arrays were equilibrated to room temperature
immediately before use. The hybridization solution prepared above was
added to each GeneChip and incubated in a GeneChip hybridization oven
(Affymetrix) at 45 °C for 16 h at a rotation rate of 60 rpm.
Following hybridization, the stain and wash procedures were carried out
in an Affymetrix GeneChip Fluidics Station 400 using the ProKGE-WS2
fluidics script to run the machine. Streptavidin solution mix (300 µl
of 2× MES stain buffer, 24 µl of 50 mg/ml BSA, 6 µl of 1 mg/ml
streptavidin, and 270 µl of dH2O), antibody solution (300 µl 2× MES stain buffer, 24 µl of 50 mg/ml BSA, 6 µl of 10 mg/ml
normal goat IgG, 6 µl of 0.5 mg/ml biotin anti-streptavidin, and 264 µl of dH2O) and SAPE solution (300 µl of 2× MES stain buffer, 24 µl of 50 mg/ml BSA, 6 µl of 1 mg/ml
streptavidin-phycoerythrin, and 270 µl of dH2O) were
prepared in amber tubes for the staining of each probe array. After
hybridization, the hybridization solution was removed and kept at
4 °C. Each GeneChipTM was filled with 300 µl of
nonstringent wash buffer (6× SSPE, 0.01% Tween 20, 0.005% Antifoam).
The GeneChips were inserted into the fluidics station, and the
ProKGE-WS2 protocol was selected to control the staining and washing of
the probe arrays. After the procedure was complete, the GeneChips were
removed from the fluidics station and checked for large bubbles or air
pockets before scanning. The buffer in the GeneChips were drained and refilled with nonstringent buffer if bubbles were present.
Experimental Design for Nylon Filter DNA Array
Experiments--
The experimental design for the nylon filter DNA
array experiments reported here is diagrammed in Fig.
1. In experiment 1, filters 1 and 2 were
hybridized with 33P-labeled, random hexamer-generated
cDNA targets complementary to each of three RNA preparations (RNA
1-3) obtained from the cells of three individual cultures of the
lrp+ strain (IH-G2490). These three
33P-labeled cDNA target preparations were pooled prior
to hybridization to the full-length ORF probes on the filters
(experiment 1). Following PhosphorImager analysis, these filters were
stripped and again hybridized with pooled, 33P-labeled
cDNA targets complementary to each of another three independently
prepared RNA preparations (RNA 1-3) from the
lrp
(IH-G2491) (experiment 1). This procedure
was repeated two more times with filters 1 and 2 using two more
independently prepared pools of cDNA targets (experiment 2, RNA
4-6). Another set of filters, filters 3 and 4, were used for
experiments 3 and 4 as described for experiments 1 and 2. This protocol
results in duplicate filter data for four experiments performed with
cDNA targets complementary to four independent prepared sets of
pooled RNA. Thus, because each filter contains duplicate spots for each
ORF and duplicate filters were used for each experiment, four
measurements for each ORF from each experiment were obtained. These
four measurements for each experiment were averaged for further
statistical analysis.

View larger version (24K):
[in this window]
[in a new window]
|
Fig. 1.
Experimental design for nylon filter DNA
array experiments. See "Materials and Methods" for
description.
|
|
Experimental Design for Affymetrix GeneChip Experiments--
The
experimental design for the Affymetrix GeneChip experiments reported
here is diagrammed in Fig. 2. The same 24 total RNA preparations used for the nylon filter experiments were
pooled into sets of 3 and used for the preparation of biotin-labeled RNA targets for hybridization to Affymetrix GeneChips. For
experiments 1-4, four GeneChips were hybridized with biotin-labeled
RNA pools 1-3, 4-6, 7-9, and 10-12 prepared from
lrp+ cells, and four GeneChips were hybridized
with biotin-labeled RNA pools 1-3, 4-6, 7-9, and 10-12 prepared
from lrp
cells, respectively. One average
difference measurement for each gene probe set on each GeneChip was
obtained for subsequent data processing and analysis.

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 2.
Experimental design of the Affymetrix
GeneChip experiments. See "Materials and Methods" for
description.
|
|
Data Acquisition from the Nylon Filter DNA Array--
A
commercial software package obtained from Research Imaging Inc. (DNA
ArrayVision) was used to grid the 16-bit image file obtained from the
PhosphorImager, to record the pixel density of each of the 18,432 addresses on each filter, and to perform the background subtractions.
8,580 of the addresses on each filter are spotted with duplicate copies
of each of the 4,290 E. coli ORFs. The remaining 9,852 empty
addresses were used for background measurements. Because the
backgrounds were quite constant, a global average background
measurement was subtracted from each experimental measurement, although
local background calculations are possible. Greater than 4 logs of
linearity for the PhosphorImager-derived data were observed.
Data Acquisition of Affymetrix GeneChips--
Each
GeneChip array was scanned twice with an HP GeneArray confocal laser at
a 3 µM resolution, and the intensities at each perfect
match (PM) and mismatch (MM) probe cell from both scans were averaged
and saved as a *.DAT file. The average intensity of each GeneChip was
globally scaled to 2500 and saved as a *.CEL. These probe pair
measurements for each probe set were used for subsequent data
processing and statistical analysis.
Model-based Oligonucleotide GeneChip Analysis--
Gene
expression values from Affymetrix GeneChips are based on the average
difference (AD) between hybridization signals of PM and MM
oligonucleotide probe sets for each gene as described in the expression
analysis technical manual from Affymetrix. The AD value of each probe
set is calculated as AD =
(PM-MM)/number of probe pairs.
Algorithms incorporated into Affymetrix software remove probe pairs
that are out of a given range when calculating AD values for each probe
set. In this process, the mean and standard deviation are calculated
for intensity differences (PM
MM) across the entire probe set
(excluding the highest and lowest values), and values within a set
number of standard deviations (3 as default) are not included in the
calculation. The advantage is that this process minimizes the variance
introduced by experimental or biological error by removing the outliers
present in each probe set. The disadvantage is this that this process
does not always remove the same probe pairs for the calculation of the
AD values among GeneChips. This can lead to the misinterpretation of
the gene expression profiles obtained from GeneChip experiments. To
alleviate this problem, a model-based method incorporated into a
program called dChip has been described by Li and Wang (11). This
method maintains constant probe pair set identities across all
GeneChips while excluding outliers because of cross-hybridization,
contamination during hybridization, or manufacturing defects that
affect probe set measurements. For all of the GeneChip experiments
reported here, each probe pair set from the *.CEL files was modeled by the dChip software prior to statistical analysis.
Statistical Methods--
As described above, the experimental
design employed in this study consists of 33P-labeled
cDNA target preparations for each of two genotypes hybridized to
nylon filters, or three biotinylated mRNA target preparations hybridized to Affymetrix GeneChips. The designs for these experiments are depicted in Figs. 1 and 2. For each measurement, a background subtracted estimate of expression level for each gene was obtained and
scaled to total counts by dividing each individual gene expression value by the total of all values on the filter or GeneChip. Thus, each
normalized gene level is expressed as a fraction of the total mRNA
hybridized to each DNA array.
For any given measurement, a value greater than zero (indicating an
expression level) or a zero (indicating an expression level lower than
background) is obtained. Only those genes exhibiting an expression
level greater than zero in all experiments were used for statistical
analysis. Gene measurements containing zero expression values were
set aside. Among this set of genes, those with zero
expression values for all measurements in one genotype, and all values
greater than zero for all measurements of another genotype for each
experiment were identified. The significance of these results was
analyzed by ranking these genes in ascending order according to their
coefficients of variance of the four greater than zero measurements.
The remaining genes were analyzed both by a simple t test
and a regularized t test based on a Bayesian statistical
framework described under "Results and Discussion."
Data Accession--
All of the raw and processed data for the
experimental results reported here are available in tabular format as
supplemental data in the on-line version of this article.
 |
RESULTS AND DISCUSSION |
An ad Hoc Method for the Estimation of Global False Positive
Levels--
To interpret the results of a high dimensional DNA array
experiment, it is necessary to determine the global false positive level inherent in the data set being analyzed. The global false positive level reflects all sources of experimental and biological variation inherent in a DNA array experiment. The basic idea is to
infer the false positive level in the control versus
treatment situation from the false positive level observed with the
control versus control (and/or treatment versus,
treatment) comparison. With this information, a global level of
confidence can be calculated for differentially expressed genes
measured at any given statistical significance level. For example,
consider an experiment comparing the gene expression profiles of two
genotypes, where an average of 10 genes are observed to be
differentially expressed with a p value less than 0.0001 when gene expression profiles from one genotype are compared with data
of the same genotype (e.g. lrp+
versus lrp+ or
lrp
versus
lrp
). Because no differential expression is
expected in these comparisons, these 10 genes are clearly false
positives generated by chance occurrences driven by experimental errors
and biological variance. Now, if 100 genes are differentially expressed
with a p value less than 0.0001 when the data from one
genotype (lrp+) are compared with the data from
the other genotype (lrp
), it is reasonable to
infer that we can be only 90% confident that the differential
expression of any one of these 100 genes is biologically meaningful
because 10 false positives are expected from this data set. This
example demonstrates that, although the confidence level based on the
measurement for an individual gene may exceed 99.99% for two treatment
conditions (local confidence of 0.0001), the confidence that this gene
is differentially expressed might be only 90% (global confidence of
0.9). This example defines an ad hoc method of comparing
control to control data to derive an estimate of an experiment-wide
false positive level.
We applied this ad hoc method for the estimation
of false positive levels of the experiments described here by averaging
the four measurements for each gene from the duplicate control filters of each experiment hybridized with labeled targets from the control strain IH-G2490 (lrp+) and comparing these
averaged values of control data from experiments 1 and 3 to the
averaged values of control data from experiments 2 and 4 (Fig. 1). In
another analysis, we compared control data from experiments 1 and 4 to
the averaged values of control data from experiments 2 and 3. Equivalent comparisons were performed with filters hybridized with
labeled targets from the experimental strain (IH-G2491
(lrp
)). These particular two-by-two (control
versus control or experimental versus
experimental) comparisons were chosen because they average across
experimental errors and biological differences both among filters and
RNA preparations. The results of a simple t test analysis of
these data were ranked in ascending order of the p values
for each gene measurement based on the t test distribution.
The results of these statistical analyses are shown in Table
I.
The data in Table I show that, among the control versus
control or experimental versus experimental comparisons, no
genes exhibited a p value less than 0.0001. However, an
examination of the p values observed when the control data
were compared with the experimental data shows that 12 genes were
differentially expressed with a p value less than 0.0001. Thus, we can be fairly certain that these 12 genes are differentially
expressed because of biological reasons and not by chance occurrences
driven by experimental error and biological variance. On the other
hand, we know from the literature that more than 12 genes are regulated by Lrp (5, 6). This demonstrates that, given the experimental errors
inherent in this experiment, the differentially expressed levels of most genes cannot pass this stringent statistical test. Therefore, to identify other differentially expressed genes, we must
lower the stringency of our statistical criterion. The data in Table I
show that, as the p value is raised to 0.005, we observe an
additional 122 genes that are differentially expressed at this threshold level. At the same time, raising the statistical threshold to
0.005 reveals an average of 3.75 genes that appear differentially expressed with a p value equal to or less than 0.005 when
the control or experimental data sets are compared with themselves. This means that, given this complete data set from four replicate experiments, we expect at least 3.75 false positives among the 134 genes differentially expressed with a p value equal to or less than 0.005. Therefore, our global confidence in the identification of any one of these 134 genes as differentially expressed genes is
estimated to be 97%.
It should be emphasized that relaxing the p value threshold
rapidly increases the average number of false positives in the control
(lrp+ versus
lrp+ or lrp
versus lrp
) data sets relative to
the number of genes differentially expressed at the same p
value in the experimental (lrp+
versus lrp
) data set and,
therefore, decreases the confidence with which differentially expressed
genes can be identified.
Improved Statistical Inference from DNA Array Data Using a Bayesian
Statistical Framework--
A simple t test evaluates the
distance between the means of two groups normalized in terms of the
within-group standard deviations. The result is that large differences
between genotypes for any given ORF will be declared nonsignificant if
the expression level of that ORF is unreplicable within experimental
treatments. Conversely, small differences in expression
will be determined to be statistically significant for a given ORF if
expression levels for that ORF are replicable within treatments. In
short, the t test statistic is constructed by scaling the
difference in gene expression levels between genotypes relative to the
observed variances within genotypes. p values based on the
t test statistic range from 1.0 for gene expression levels
with identical values associated with the null hypothesis to very small
p values for differential gene expression levels that are
highly significant.
In a perfect world, all DNA microarray experiments would be highly
replicated. Such replication would allow accurate estimates of the
variance within experimental treatments to be obtained, and the
t test would perform well, i.e. the variance for
each gene measurement would be based on many measurements for that gene. However, DNA microarray experiments are expensive and
time-consuming to carry out. As a result, the level of replication
within experimental treatments is often low. This results in poor
estimates of variance and a correspondingly poor performance of the
t test itself. On the other hand, we have shown that the
confidence in the interpretation of DNA microarray data with a low
number of replicates can be improved by using a Bayesian statistical
approach that incorporates information of within treatment measurements
(12, 13). This results in a more consistent set of differentially
expressed genes identified with fewer replicates. The Bayesian approach
is based on the observation that genes of similar expression levels
exhibit similar variance. Thus, more robust estimates of the variance of a gene can be derived by pooling neighboring genes with comparable expression levels. For the analysis of the data reported here, we
ranked the mean gene expression levels of the replicate experiments in
ascending order, used a sliding window of 101 genes, and assigned the
average standard deviation of the 50 genes ranked below and above each
gene as the background standard deviation for that gene. The variance
of any gene within any given treatment then can be estimated by the
weighted average of the treatment-specific background variance and the
treatment-specific empirical variance across experimental replicates.
In the Bayesian approach employed in this study, the weight
given to the within experiment gene variance estimate is a function of
the number of experimental replicates. This leads to the desirable
property that the Bayesian approach employing such a regularized
t test converges on the same set of differentially expressed
genes as the simple t test but with fewer replicates
(12).
A comparison of the results of statistical analyses employing a simple
t test and a regularized t test is shown in Table
II. Here, the simple ad hoc
method of comparing controls to controls was used to demonstrate that
the number of false positives expected at a given p value is
lower when the Bayesian statistical framework is employed. For example,
only 2 false positives are expected at a p value threshold
less than 0.005 with the Bayesian regularization, whereas 3.75 false
positives are expected at this same p value threshold with
the t test alone. At the same time, 188 differentially expressed genes with a p value less than 0.005 are observed
with the regularized t test, whereas only 134 genes are
identified at this same threshold with the simple t test
(Table II). Thus, more genes are identified with a lower false positive
level and a higher global confidence level. In other words,
complementing the empirical variance of the four experimental
measurements for each gene with the corresponding background variance
within an experiment improves our confidence in the identification of
differentially expressed genes and the number of genes that can be
identified at a given p value threshold based on a
t test distribution.
Although the data in Table II show that the Bayesian statistical
approach using a regularized t test identifies more genes with a higher level of global confidence than the simple t
test, the natural question that arises is whether these genes are true positives, i.e. whether these are Lrp-regulated genes. This
question is addressed by the data shown in Fig.
3. For example, of the 44 genes
differentially expressed between lrp+ and
lrp
strains with a p value less
than 0.001 identified by a simple t test, 10 are known to be
Lrp-regulated (Table III). However, among
the 39 genes differentially expressed between
lrp+ and lrp
strains
with a p value less than 0.0001 identified by the Bayesian approach, 17 are known to be Lrp-regulated (Table
IV).

View larger version (23K):
[in this window]
[in a new window]
|
Fig. 3.
Scatter plot showing the mean of the
fractional mRNA levels obtained from eight filters hybridized with
33P-labeled cDNA targets prepared from three pooled RNA
preparations extracted from Escherichia coli K12
strains IH-G2490 (lrp+) and IH-G2491
(lrp ). A, the
larger black dots identify 100 genes
differentially expressed between strains IH-G2490 and IH-G2491 with
p values less than 0.0034 based on a simple t
test distribution. The circled black
dots identify genes known to be regulated by Lrp. The
gray spots represent the relative expression
levels of each of the 2,758 genes expressed at a level above background
in all experiments. The dashed lines demarcate
the limits of 2-fold differences in expression levels. B,
the larger black dots identify 100 genes differentially expressed between strains IH-G2490 and IH-G2491
with p values less than 0.0014 based on a regularized
t test. The circled black
dots identify genes known to be regulated by Lrp. The
gray spots represent the relative expression
levels of each of the 2,758 genes expressed at a level above background
in all experiments. The dashed lines demarcate
the limits of 2-fold differences in expression levels.
|
|
View this table:
[in this window]
[in a new window]
|
Table III
Genes differentially expressed between lrp+ and lrp
(control vs. experimental) E. coli strains with a p value less than
0.001 identified with a simple t test
The data are presented as the average (mean) and S.D. of four
independent gene expression measurements expressed as a fraction of the
total hybridization signal (total mRNA) on each DNA microarray
filter.
|
|
View this table:
[in this window]
[in a new window]
|
Table IV
Genes differentially expressed between lrp+ and lrp
(control vs. experimental) E. coli strains with a p value less than
0.0001 identified with a regularized t test.
The data are presented as the average (mean) and S.D. of four
independent gene expression measurements expressed as a fraction of the
total hybridization signal (total mRNA) on each DNA microarray
filter.
|
|
Why does the regularized t test identify more
Lrp-regulated genes? The answer to this question lies in the fact that
all of the genes identified to be differentially expressed with a
p value less than 0.005 with the regularized t
test exhibit -fold changes greater than ~1.7-fold (Fig.
3B). However, many genes identified to be differentially
expressed with a p value less than 0.005 with the simple
t test exhibit -fold changes as small as ~1.2-fold (Fig.
3A). Furthermore, the 100 genes with the lowest p
values identified as differentially expressed by both methods contain only 43 genes in common. Thus, many of the genes identified by the
simple t test that are excluded by the Bayesian approach are genes that show small -fold changes. In general, these genes with small
-fold changes identified by the simple t test are associated with "too good to be true" within treatment variance estimates, reflecting underestimates of the within treatment variance when the
number of observations is small. The elimination of this source of
false positives by the Bayesian approach improves the
identification of true positives. However, although this is desired,
genes that are truly differentially expressed with small -fold changes
in the range of ~1.2-1.7-fold will also be eliminated by the
Bayesian approach. For example, of the 16 genes of the top 100 with the lowest p values identified by the simple t test
that are known to be regulated by Lrp, one was not identified by the
Bayesian method. This Lrp-regulated gene that did not pass the
regularized t test was the sdaC gene, previously
reported to be regulated by Lrp 3-fold (14, 15) and measured to be
regulated 1.9-fold in the experiment performed with the DNA arrays.
Nevertheless, although this gene is lost, the overall performance of
the regularized t test surpasses that of the simple
t test.
At first glance it might appear that the Bayesian approach validates
the often-used 2-fold rule for the identification of differentially
expressed genes (16), i.e. the identification of genes
differentially expressed between two experimental treatments with a
-fold change greater than 2 in, for example, three of four experiments.
This type of reasoning is based on the intuition that larger observed
-fold changes can be more confidently interpreted as a stronger
response to the experimental treatment than smaller observed -fold changes, which of course is not necessarily the case. An
implicit assumption of this reasoning is that the variance among
replicates within treatments is the same for every gene. In reality,
the variance varies among genes and it is critical to incorporate this
information into a statistical test (12). Clearly, with a background
standard deviation of, for example, 50, differential expression
measurements of 200/100 and 20,000/10,000 have different significance.
This point is further emphasized by simply examining the scatter plots
in Fig. 3. Here, many genes that appear differentially expressed
greater than 2-fold do not exhibit p values less than 0.005 and a global confidence level of at least 97%. This does not mean that
these might not be Lrp-regulated genes; it simply means that they are
false negatives that cannot be identified at this level of confidence.
Commonly used software packages do not possess algorithms for
implementing Bayesian statistical methods. Therefore, we developed a
statistical program, CyberT, which does accommodate this approach. We
use the statistical tools incorporated into CyberT to compare and
analyze the gene expression data from the experiments described here.
CyberT is available for on-line use on the genomics web site of the
University of California, Irvine.
A Computational Method to Determine False Positive
Levels--
A computational version of our ad hoc method
for estimating false-positive levels has been recently described (17).
The basic idea is to consider the p values as a new data set
and to build a probabilistic model for this new data. When control data sets are compared with one another (i.e., no differential
gene expression), it is easy to see that the p values ought
to have a uniform distribution between 0 and 1. In contrast, when data sets from different genotypes or treatment conditions are compared with
one another, the distribution of p values will tend to
cluster more closely to 0 than 1, i.e., there will be a
subset of differentially expressed genes with "significant"
p values. One can use a mixture of
distributions to
model this distribution of p values in the form shown in
Equation 1.
|
(Eq. 1)
|
For i = 0, we use r0 = s0 = 1 to implement the uniform distribution as
a special case of a
distribution. Thus, K + 1 is the
number of components in the mixture and the mixture coefficients
i represent the prior probability of each
component. In many cases, two components (K = 1) are
sufficient but sometimes additional components are needed. In general,
the mixture model can be fit to the p values using the
Expectation Maximization algorithm or other iterative
optimization methods to determine the values of the
, r,
and s parameters (17). From the mixture mode given
n genes, the estimate of the number of genes for which there
is a true difference is n(1
0).
In the case of the data reported here, the parameters of the
mixture model of Equation 1 with two
components are given by the
following:
0 = 0.56,
i = 0.44, r0 = 1, s0 = 1, r1 = 0.45, s1 = 3.01. For any
p value threshold T, the mixture model allows us
to estimate the rate of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN) consistent with the
statistical assumptions made to derive the original set of p
values. More precisely, in a general mixture of
models, we have as
follows.
|
(Eq. 2)
|
|
(Eq. 3)
|
|
(Eq. 4)
|
|
(Eq. 5)
|
If we set a threshold T below which p values
are considered significant and representative of change, we can
estimate the rates of false positives and false negatives. The
posterior probability for differential expression (PPDE) then can be
calculated for each gene in the experiment with p value
p as PPDE(p) according to Equation 6.
|
(Eq. 6)
|
Alternatively, one can calculate a posterior probability of
differential expression PPDE(< p) for values below a
certain threshold p according to Equation 7.
|
(Eq. 7)
|
The distribution of p values from our
lrp+ versus lrp
data is
shown in Fig. 4, and a plot of
PPDE(p) and PPDE (< p) values versus
p values is shown in Fig. 5. A
comparison of the ad hoc method for determining the global
significance for the differential expression of a given gene and the
computational method is presented in Table
V. It is satisfying to see that these
data compare well.

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 4.
Distribution of the p values
from the lrp+ versus
lrp data. The fitted model
(dashed curve) is a mixture of a and the
uniform distribution (dotted line).
|
|

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 5.
Relationship between PPDE and p
value. PPDE (< p), gray points;
PPDE(p), black points. The dotted line correlates
the number of genes differentially expressed with PPDE (< p) of 0.97 that are measured with p < 0.0014.
|
|
It is clear from the data of Fig. 5 that for each p value
threshold T, there is a tradeoff between the rates of true
and false positives. A low conservative p value threshold
leads to few FP but may also reduce the TP rate. A large p
value threshold ultimately allows one to recover all the TP but at the
cost of increasing the FP rate. This fundamental tradeoff is usually
captured in statistics by using a receiver operating characteristic
curve obtained by plotting the true hit rate (or sensitivity) defined by TP/(TP + FN) versus the false hit rate, FP/(FP + TN)
(86). In the mixture model above, a simple calculation shows that for a
given p value threshold T,
|
(Eq. 8)
|
With two components in the mixture (K = 1), the last
expression reduces to the following.
|
(Eq. 9)
|
Thus, for our Lrp data the receiver operating characteristic curve
in Fig. 6 is simply the distribution
function of the second
component in the mixture. For instance, this
curve demonstrates that with a 76% true hit rate we can expect a 20%
false hit rate.

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 6.
Receiver operating characteristic curve.
This plot correlates the fraction of correctly identified
differentially expressed genes (y axis) with the
fraction of falsely identified differentially expressed genes (x
axis).
|
|
The Functional Classes of Genes Differentially Expressed in
lrp+ and lrp
E. coli Strains--
To
facilitate the following discussions, we limit our considerations to
the 100 genes differentially expressed with the lowest p
value based on a regularized t test. The 100 genes
differentially expressed between lrp+ and
lrp
E. coli strains with a
p value less than 0.0014 and a PPDE greater than 0.98 are
listed in Table VI. In the text we simply
refer to the -fold change for each gene. However, as emphasized above, it should be kept in mind that reporting -fold changes is incomplete and can be misleading. For this reason, mean expression levels, standard deviations, p values, and PPDE values for the 39 genes with p values less than 0.0001 are reported in Table
IV. Additional statistical data for the remaining 61 genes with a
p value less than 0.0014 as well as all genes expressed at a
level of above background in all four experiments can be found in the
supplemental data (available in the on-line version of this
article).
Because the physiological purpose of Lrp is presumed to be the
coordination of gene expression levels with the nutritional and
environmental conditions of the cell (5, 6), it was pleasing to
discover that most of the genes affected by Lrp are ones that encode
products involved in small molecular and macromolecule synthesis or
degradation, as well as gene systems involved in small molecule
transport and environmental stress responses.
These genes can be sorted into the
functional groups shown in Fig. 7; they also are listed in
Table VI and discussed below.

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 7.
Distribution of functions for genes
differentially expressed between lrp+ and
lrp Escherichia
coli strains.
|
|
View this table:
[in this window]
[in a new window]
|
Table VII
Genes differentially expressed between lrp+ and lrp
(control vs. experimental) E. coli strains with four measurements below
background obtained from lrp+ or lrp strains
The data are presented as the average (mean) and S.D. of four
independent gene expression measurements expressed as a fraction of the
total hybridization signal (total mRNA) on each DNA microarray
filter. NA, not available.
|
|
Small Molecule Biosynthesis--
Among the genes differentially
expressed between lrp+ and
lrp
strains, 11 are genes required for amino
acid biosynthesis. Of these, the ilvG, ilvM,
leuB, and serA genes are members of operons previously reported to be regulated by Lrp.
The ilvG and ilvM genes, the first two genes of
the ilvGMEDA operon, encode the two subunits of acetohydroxy
acid synthase II, one of three isoenzymes catalyzing the first step of
the parallel pathway for L-valine and
L-isoleucine biosynthesis. We have previously used an
ilvPG::lacZ construct to
measure
-galactosidase activities in the same isogenic
lrp+ and lrp
strains
employed in this study (7). These results showed that
-galactosidase
was increased 2.5-fold in the lrp
mutant
strain. The data reported here are consistent with this earlier report.
We have also described the presence of a constitutive internal
promoter, ilvPE, located between the
ilvM and ilvE genes of this operon (18). This
affect of the internal promoter is apparent in our DNA microarray data;
the expression of the operon distal ilvEDA genes is
decreased only 1.2-fold.
It is interesting that two other genes of the aspartate family of amino
acids previously unknown to be regulated by Lrp appear in this list
(19). These are thrL, the leader polypeptide of the
threonine operon (20-23), and asd, the structural gene for aspartyl-semialdehyde dehydrogenase (24, 25). This enzyme is involved
in the conversion of oxaloacetate to homoserine, a precursor of
threonine and isoleucine. These findings suggest the possibility that
all of the genes of the aspartate family that convert the TCA
cycle intermediate, oxaloacetate, to amino acids might be sensitive to
Lrp-mediated regulatory effects.
The serA gene encodes phosphoglycerate dehydrogenase, the
first enzyme specific for serine biosynthesis. Newman and colleagues (6, 15) have reported that the transcriptional level of serA is decreased 6-fold in a lrp
strain. Our
results exhibit a similar transcriptional regulation. Newman and
colleagues (15, 26) also have reported that the expression level of the
leu operon is decreased 11-fold in a
lrp
strain and showed that the growth rate of
a lrp
strain is increased by adding leucine to
the growth medium. On the other hand, Landgraf et al. (27)
have suggested that Lrp-mediated regulation of the leu
operon is indirect and reported a much smaller effect (1.4-fold). Our
studies agree with those of Landgraf et al.
Because these known Lrp-regulated genes identified by our experiments
are detected with a high level of measurement accuracy and confidence,
we can be similarly confident that the expression of other genes in
this group are also members of the Lrp regulon. However, the
differential expression of these newly identified genes could be the
consequences of either primary or secondary Lrp effects. An obvious way
to discern whether or not the operons containing these genes are
directly regulated by Lrp would be to search for Lrp binding sites in
their promoter-regulatory regions (10). Unfortunately, because of the
degeneracy of the consensus Lrp binding sequence, this is not possible.
Even when a 3 of 15 mismatch is allowed, 60% of all regions 500 base
pairs upstream of all E. coli ORFs contain at least one
putative Lrp binding site. Thus, it is difficult to determine at this
time whether the differential expression of these genes is directly or
only indirectly affected by Lrp.
Small Molecule Transport--
22 of the 100 genes listed in Table
VI are involved in small molecule transport. Of these, 11 have been
documented to be regulated by Lrp. Products of the livJ and
livKHMGF genes are components of two transport systems with
high affinity for leucine. The livJ gene product binds
leucine, isoleucine, and valine, whereas the livK gene
product is specific for leucine alone. These two systems share a set of
membrane components, products of the livHMGF genes. Haney
et al. (28) have reported that both of these operons are repressed by Lrp in the presence of high concentration of leucine. Bhagwat et al. (29) have reported that in the absence of
leucine the expression of the livJ gene is unaffected by Lrp
and that the expression of the livKHMGH operons is activated
approximately 15-fold. Because the experiments reported here were also
performed in the absence of leucine, we would expect similar results
and, in fact, we observe a 2.5-10-fold activation of the genes for the
livKHMGF operon. However, our results suggest that Lrp is also responsible for a 2-fold repression of livJ under these
growth conditions.
The oppABCDF operon contains genes encoding a periplasmic
binding protein and transport permease proteins for a wide range of
tripeptide transport systems. Austin et al. (30) have
reported that this operon exhibits high constitutive expression in a
lrp
strain. Accordingly, our results show that
the expression of the oppA and oppB genes is
increased 15- and 20-fold, respectively, but that the expression of the
oppC, oppD, and oppF genes are increased to a lesser extent. These data suggest the possibility of an
unidentified internal promoter between oppB and
oppC.
Four proteins, the malEFG and -K gene
products, are required for maltose uptake in E. coli. These
four genes are arranged in two operons, malEFG and
malK-lamB-malM. Tchetina et al.
(31) have reported that transcription of both of these operons are decreased 50-70% in a lrp
strain grown in
glycerol. Our results demonstrate that the transcription level of both
operons also is decreased approximately 80% in a lrp
strain grown in a glucose-supplemented
minimal MOPS medium. However, because malE is the only gene
of either operon that passed our statistical cut-off, we cannot be as
confident that Lrp also affects the expression of the other genes of
these operons (see supplemental data, available in on-line version of
this article).
Of the remaining newly identified genes of this class, we find two
examples of two genes in the same operon, artP and
artI of the artPIQMJ operon and potH
and potG of the potFGHI operon. artP
and artI are involved in arginine transport (32). The
expression of these genes is increased in the
lrp
strain 6.3- and 4.6-fold, respectively.
The potH and potG genes are members of the
potFGHI operon involved in the transport of putrescine (33).
The expression of these genes is decreased ~3-fold in the
lrp
strain. In addition to these previously
documented systems, our results suggest that transport systems involved
in the transport of various dipeptides, carbohydrates, organic acids,
alcohols, and inorganic compounds are also influenced by Lrp
(34-42).
Carbon and Energy Metabolism--
Besides its influence on the
transport of organic acids, Lrp also influences the expression of genes
involved in the metabolism of these compounds. For example, the levels
of expression of the structural genes for malP and
manA are both altered in a lrp
strain (43-45). The remaining genes of this group listed in Table VI
are involved either in the catabolism of glucose under aerobic conditions, or alternative carbon sources under anaerobic conditions (46-50).
Macromolecular Biosynthesis--
8 of the 100 genes listed in
Table VI are involved in macromolecule synthesis. Three of these genes,
hns, hupB, and himD, encode proteins
that influence the structure and DNA topology of the E. coli
chromosome. The remaining five are involved in protein synthesis or
degradation. Of these, only one has been previously identified as a
Lrp-regulated gene. This gene, lysU, encodes one of the two
lysyl-tRNA synthetases in E. coli. Gazeau et al.
(51, 52) have reported that this gene is repressed 9-fold by Lrp. Under
the conditions of our experiments, Lrp represses the expression of the
lysU gene 7-fold. Of the four newly identified genes of this
group, two, clpA and pepD, are proteases involved in protein degradation (53, 54). The remaining two genes, rpmI and rmf, encode ribosome-associated proteins
(55-58).
The hns, hupB, and himD genes are the
structural genes for the H-NS, HU, and IHF proteins that are important
for the condensation of the chromosome into a nucleoid structure, for
restraining negative supercoils, and in several cases for the
regulation of gene expression (3, 59-61). In each case, these genes
are repressed by Lrp. It is likely that the effect of Lrp-mediated
effects on the expression levels of these global regulatory gene
products might be responsible for many secondary changes in gene
expression levels observed in lrp
strains.
Regulatory Proteins of Stress Responses--
The expression levels
of several proteins involved in cellular adaptation to nutritional and
environmental assaults are increased 2-3-fold in the
lrp
strain. These include the alternative
sigma factors rpoS and rpoE. The rpoS
sigma factor,
38, is a central regulator of many
stationary phase-responsive genes. Although it is induced to high
levels in early stationary phase cells, it also is expressed, albeit at
a lower level, during the exponential growth phase, where it functions
as a general stress response element essential for prolonged cell
survival (62, 63). It is involved in the induction of several genes
important for osmotic, oxidative, heat, and DNA damage stress responses (64). The rpoE gene encodes another sigma element,
24, that also is expressed at a higher level in the
lrp
strain. Although the major functions of
coping with thermal stress are encoded by genes transcribed by
32, genes transcribed by
24 are necessary
for survival under extreme temperature stress conditions (65).
Interestingly, although the expression level of the rpoH gene for 