If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
§ These authors contributed equally to this study. * This work was supported by Grants R01 CA18689 and P20 GM66401 and fellowship F32 GM067392 (to J. W. T.) from the National Institutes of Health, and D90-9870710 and KDI-9980088 from the National Science Foundation. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. The on-line version of this article (available at http://www.jbc.org) contains Supplementary Data.
We have taken an integrated approach in which expression profiling has been combined with the use of small molecule inhibitors and computational analysis of transcription factor binding sites to characterize regulatory sequences of genes that are targets of specific signaling pathways in growth factor-stimulated human cells. T98G cells were stimulated with platelet-derived growth factor (PDGF) and analyzed by DNA microarrays, which identified 74 immediate-early gene transcripts. Cells were then treated with inhibitors to identify subsets of genes that are targets of the phosphatidylinositol 3-kinase (PI3K) and MEK/ERK signaling pathways. Four groups of PDGF-induced genes were defined: independent of PI3K and MEK/ERK signaling, dependent on PI3K signaling, dependent on MEK/ERK signaling, and dependent on both pathways. The upstream regions of all genes in the four groups were scanned using TRANSFAC for putative cis-elements as compared with a background set of non-induced genes. Binding sites for 18 computationally predicted transcription factors were over-represented in the four groups of co-expressed genes compared with the background sequences (p < 0.01). Many of the cis-elements identified were conserved in orthologous mouse genes, and many of the predicted elements and their cognate transcription factors were consistent with previous experimental data. In addition, chromatin immunoprecipitation assays experimentally verified nine predicted SRF binding sites in T98G cells, including a previously unknown SRF site upstream of DUSP5. These results indicate that groups of human genes regulated by discrete intracellular signaling pathways share common cis-regulatory elements.
The identification of regulatory elements that control gene expression is one of the paramount problems in genomics and systems biology. However, computational identification of transcription factor binding sites is difficult because they consist of short, degenerate sequences that occur frequently by chance (
). An alternative strategy limits searches for these elements to the upstream regions of genes that might be expected to be regulated by common transcription factors because they are functionally related (
In the present study, we have taken an integrated approach in which microarray expression profiling has been combined with the use of small molecule inhibitors to identify candidate transcription factor binding sites in groups of genes that are regulated by specific signaling pathways in growth factor-stimulated human cells. Many growth factors stimulate receptor-protein tyrosine kinases, leading to activation of intracellular signaling pathways that modulate gene expression by altering the activity of transcription factors (
). A primary response to growth factor stimulation of mammalian cells is the transcriptional induction of ∼100 immediate-early genes, whose induction results directly from the post-translational modification of pre-existing transcription factors (
/ERK, and phosphatidylinositol 3-kinase (PI3K) pathways. We used microarray analysis to identify immediate-early genes induced by the MEK/ERK and PI3K pathways, which play critical roles in cell proliferation and survival. Activation of the MEK/ERK pathway is mediated by the Raf protein kinases, which are coupled to growth factor receptors by Ras proteins (
). Once activated, ERK phosphorylates a variety of targets, including transcription factors and the protein kinase Rsk. Stimulation of growth factor receptors also results in activation of PI3K, leading to formation of the membrane phospholipid PIP3. PIP3 activates several downstream targets, including the protein kinase Akt, which plays a critical role in cell survival (
). Like ERK, Akt and other targets of PI3K signaling phosphorylate and activate transcription factors, leading to the rapid induction of immediate early genes.
Since induction of immediate-early genes is directly linked to signaling pathways that target transcription factors, genes that are responsive to a common signaling pathway might be expected to share transcription factor binding sites. We therefore sought to identify regulatory elements of genes induced by PI3K and MEK/ERK signaling, using a statistical analysis to identify transcription factor binding sites that were over-represented in the genomic regions upstream of groups of co-expressed genes. This approach identified binding sites for a limited number of transcription factors that were present at a high frequency upstream of genes regulated by specific signaling pathways. Many of the transcription factors predicted as regulators of immediate-early genes were established targets of the appropriate signaling pathways, and many of the predicted transcription factor binding sites were consistent with published experimental data and/or conserved in orthologous mouse genes. In addition, predicted binding sites for serum response factor (SRF) were confirmed directly by chromatin immunoprecipitation. It thus appears that biologically relevant transcription factor binding sites can be identified in groups of genes regulated by common signaling pathways in mammalian cells.
Cell Culture and Treatments—T98G human glioblastoma cells were grown in Minimal Essential Medium (Invitrogen) supplemented with fetal calf serum (10%). For growth factor/inhibitor treatments, cells were incubated in serum-free medium for 72 h, and either left unstimulated, or stimulated for 30 min with human PDGF-BB (50 ng/ml) (Sigma). U0126 (10 μm) (Cell Signaling Technology) and LY294002 (50 μm) (BioMol) were added 60 min prior to PDGF addition.
Immunoblots—In parallel to all microarray experiments, the activities of PI3K and MEK/ERK signaling pathways were assessed by immunoblotting cell lysates. Proteins were separated by electrophoresis in 8% SDS-polyacrylamide gels, electroblotted to nitrocellulose membranes, and probed with anti-phospho-Akt or anti-phospho-ERK antibodies (Cell Signaling Technologies) as recommended by the manufacturer. Blots were visualized using horseradish peroxidase-linked secondary antibody, and chemiluminescence (Amersham Biosciences).
RNA Preparations and Microarray Processing—Agilent Human I cDNA microarrays, containing PCR-amplified cDNA clones, were processed per manufacturer's guidelines. Briefly, RNA was isolated from multiple harvests of unstimulated and stimulated cells using TRIzol (Invitrogen) and RNeasy (Qiagen) protocols. Total RNA was oligo(dT) primed and reverse-transcribed in the presence of cyanine-coupled dCTP (PerkinElmer Life Sciences). Cyanine 3-dCTP and cyanine 5-dCTP dye-swap hybridizations were performed. Dye-swap determinations compared PDGF-stimulated cells in the presence or absence of inhibitor versus unstimulated cells. Arrays were scanned with a Gene-Pix 4000B scanner (Axon Instruments) with photomultiplier tube settings adjusted to eliminate signal saturation and provide an average Cyanine 3/Cyanine 5 intensity ratio of 1 across each array. GenePix Pro software (version 3.0) (Axon Instruments) was used to determine the Cyanine 3 and Cyanine 5 intensities for each array feature and the surrounding background. Following local background subtraction, the median intensities for each dye-swap pair were used to calculate the average log2 ratio for each feature (
Quantitative RT-PCR—Total RNA preparations for the microarray hybridizations were used in quantitative reverse transcription polymerase chain reactions (RT-PCR). Reverse transcription of 0.25 μg of total RNA was performed in 20 μl using SYBR green RT-PCR reagents and random hexamer primers (Applied Biosystems) as recommended by the manufacturer. Following a 95 °C incubation for 10 min, forty cycles of PCR (95 °C/15 s; 60 °C/1 m), were then performed on an ABI Prism 7900HT Sequence Detection System with 1 μl of the RT reaction, 100 nm PCR primers (see Supplementary Table I for primer sequences), and SYBR Green PCR Master Mix in 10-μl reactions. Threshold cycles (CT) for four replicate reactions were determined using Sequence Detection System software (version 2.0, release 4) and relative transcript abundance calculated following normalization with an 18 S ribosomal PCR amplicon. Amplification of only a single species was verified by a dissociation curve for each reaction.
Identification of Upstream Sequences—Transcription start sites relative to the human genome sequence were obtained for 64 of the 74 PDGF-induced genes from the LocusLink data base (www.ncbi.nlm.nih.gov/LocusLink/). The 5′ annotations for 13 of these transcripts were extended an average of 124 bases using the Data base of Transcription Start Sites (March 11, 2002 release) (
). Human genomic BLAST (www.ncbi.nlm.nih.gov/BLAST/) was then used to verify the position of each transcript in the genome and 1-kb upstream sequences were extracted from the corresponding GenBank™ contig records (www.ncbi.nlm.nih.gov/Entrez/). This work was based on build 29 of the human genome assembly maintained by the National Center for Biotechnology Information.
Identification of Transcription Factor Binding Sites—The computer program Match (version 1.4.1), distributed with the TRANSFAC Professional data base (Biobase Biological Databases), was used to identify putative transcription factor binding sites within each upstream sequence (
). The 400 vertebrate position weight matrices in TRANSFAC (version 6.1) were used to score every position along each promoter sequence. In order to identify the maximum number of candidate transcription factor binding sites, all positions with scores greater than predefined Match thresholds that minimize false negatives (minFN14.prf; false negative rate of 10%) were considered matches in the subsequent analysis. To prevent a bias introduced by palindromic or internally repetitive cis-regulatory elements, overlapping matches, including on opposite DNA strands, were defined as a single match.
Statistical Analysis of the Site Frequencies—The statistical significance of the frequency of a cis-regulatory element in each of the four groups of co-expressed genes was assessed by comparison against the average frequency in 194 genes expressed in both PDGF-treated samples and controls. This background set of upstream regions consisted of genes not induced by PDGF, with average log2 ratios limited to between -0.005 and 0.005 and standard deviations less than 0.25 following PDGF treatment. The upstream sequences for each gene were obtained in the same manner as the induced genes. To identify statistically over-represented binding sites in the PDGF-induced co-expressed gene groups, the mean number of sites identified per upstream region in each co-expressed gene group was compared with the mean per upstream region in the background group with a one-tailed two-sample Student's t test. In addition, a non-parametric permutation test, which does not assume a normal distribution, was used to ensure the validity of the Student's t test for the analysis. For each matrix, a permutation test was employed by randomly permuting the group labels of the background and promoter upstream sequences, and a t-value generated from the mean number of sites identified in the shuffled groups (
). After 10,000 permutations, the t-values were sorted, and a p value determined based on relative rank of the unpermuted t-value among the ordered list of t-values from the permuted groups.
Comparison with Orthologous Mouse Sequences—We identified mouse orthologs for 65 PDGF-induced genes using the mouse homology map information found in LocusLink. A 1-kb nucleotide sequence upstream of the reported mouse transcription start site was used as input to the previously described Match program. The human and mouse sequences were then aligned using the Needleman-Wunsch global alignment tool found in version 2.5.0 of The European Molecular Biology Open Software Suite (
). The gap open and extension penalties were set at 50.0 and 3.0, respectively, and the nucleotide-scoring scheme of match 10, mismatch -9 was used. The positions of each site identified in the human sequence were mapped to positions in the aligned mouse sequence, and sites occurring in both organisms at the same alignment position were recorded.
Chromatin Immunoprecipitation—Chromatin immunoprecipitations were performed as described (
), with the following modifications. T98G cells were scraped and formaldehyde fixed at 37 °C for 10 min. Shearing was performed to yield 500–1500 bp chromatin fragments with a Branson Sonifier 250, using four 30-s pulses at 25% output. Samples were precleared with sonicated salmon sperm DNA/Protein A agarose (50% slurry) and immunoprecipitated overnight at 4 °C using 4 μg/ml anti-SRF antibody (Santa Cruz Biotechnology, sc-335) (
). Complexes were then washed successively in low salt wash (0.01% SDS, 1% Triton X-100, 2 mm EDTA, 20 mm Tris-HCl, 150 mm NaCl, pH 8.1), high salt wash (0.01% SDS, 1% Triton X-100, 2 mm EDTA, 20 mm Tris-HCl, 500 mm NaCl, pH 8.1), LiCl wash (0.25 m LiCl, 1% IGEPAL-Ca 630, 1% deoxycholic acid, 1 mm EDTA, 10 mm Tris-HCl, pH 8.1), and twice in 10 mm Tris-HCl, 1 mm EDTA pH 8.0. Cross-links were reversed for 6 h at 65 °C, and samples were proteinase K treated for 2 h at 45 °C, followed by purification using a Qiagen Gel Extraction kit (Qiagen). Immunoprecipitated chromatin was quantified with real-time PCR as described above, using primers that either flanked the predicted site or amplified a fragment within 134 bp of the predicted site (see Supplementary Table I for primer sequences). Each PCR reaction was carried out in quadruplicate and results for each promoter region are derived from at least two independent chromatin immunoprecipitations. Data were normalized to input and are presented as fold increase over GAPDH, a standard negative control for SRF chromatin immunoprecipitations (
Identification of Immediate-Early Genes Induced by the PI3K and MEK/ERK Pathways—Microarray analysis was used to identify immediate-early genes induced by platelet-derived growth factor (PDGF) stimulation of quiescent T98G human glioblastoma cells, which were chosen for these experiments because they undergo reversible cell cycle arrest upon serum deprivation (
). Seventy-four genes were reproducibly induced >2-fold following 30 min of PDGF stimulation, the optimal time for induction of the immediate-early genes fos and jun (Table I). Gene inductions ranged from 2-fold to more than 80-fold (26.4) upon growth factor treatment, and were highly reproducible as evidenced by the standard deviations. Further, analysis of several representative genes by quantitative RTPCR confirmed the array data (Fig. 1). The number of genes induced was in good agreement with other studies examining immediate early gene induction, and included expected genes such as fos, jun, myc, and mcl1 (
Table IGenes induced by PDGF T98G cells were rendered quiescent and then stimulated by treatment with human PDGF-BB for 30 minutes. The values for each gene represent the mean average log2 ratio and standard deviation for dye-swap normalized determinations (N) comparing five independent cultures of PDGF-stimulated versus non-stimulated cells. Some genes were represented more than once on the array and thus have more than five determinations. Only genes induced >2-fold are presented and were used in subsequent analysis. Each gene is represented by the Unigene gene name and GenBank™ accession number provided with the microarrays.
). As expected, LY294002 inhibited phosphorylation of Akt, whereas U0126 inhibited phosphorylation of ERK (Fig. 2A). Furthermore, U0126 did not affect Akt phosphorylation and LY294002 had no effect on ERK phosphorylation, demonstrating the specificity of the inhibitors for each pathway. Gene expression profiles were then determined by analysis of PDGF-stimulated cells pretreated with inhibitors. Representative gene targets in inhibitor-treated cells (and appropriate vehicle controls) were validated using quantitative RT-PCR (data not shown).
Some genes were primarily inhibited by LY294002 or U0126, indicating that they were induced principally by either PI3K or MEK/ERK signaling, respectively, whereas others were affected by both of these pathways (Fig. 2B). In contrast, the induction of some genes was not significantly inhibited by either LY294002 or UO126 alone. Although these genes could be induced by a distinct PDGF-stimulated pathway, it is also possible that they could be responsive to both PI3K and MEK/ERK signaling, with either pathway alone being sufficient to induce gene expression. These alternatives were distinguished by treatment of cells with both LY294002 and UO126 in combination (Fig. 2C), which identified seven genes that were significantly inhibited (>2-fold) by both inhibitors in combination but not by either inhibitor alone. Induction of these genes can therefore be interpreted as being controlled by both PI3K and MEK/ERK signaling, with either pathway alone being sufficient for transcriptional activation.
Identification of Transcription Factor Binding Sites in PDGF-induced Genes—To test for common transcription factor binding sites, the PDGF-induced genes were divided into four groups (quadrants of Fig. 2B): PI3K- and MEK/ERK-independent (12 genes), PI3K-dependent (16 genes), MEK/ERK-dependent (21 genes), and dependent on both pathways (25 genes). Assignment was based on 50% inhibition by the appropriate inhibitors, which correlated with significant inhibition (p < 0.05). The seven genes that were not inhibited by LY294002 or U0126 alone, but were inhibited by both in combination, were classified as dependent on both pathways.
Sequences upstream of each transcription start site were obtained for 64 of 74 PDGF-induced genes from GenBank™ (PI3K- and MEK/ERK-independent, 10 genes; PI3K-dependent, 11 genes; MEK/ERK-dependent, 20 genes; dependent on both pathways, 23 genes), and each group of genes was analyzed using 400 vertebrate transcription factor binding site matrices from TRANSFAC (
). We limited the analysis to 1 kb to reduce detection of randomly occurring sequences. Although cis-regulatory elements are widely distributed throughout mammalian genomes, high concentrations of these elements often occur in proximal promoter regions. Based on published data in TRANSFAC, 82% of cis-regulatory elements that have been identified upstream of human genes occur within this 1-kb window.
To determine whether a transcription factor binding site was over-represented within a group of genes induced by a specific pathway (PI3K- and MEK/ERK-independent, PI3K-dependent, MEK/ERK-dependent, and PI3K- and MEK/ERK-dependent), we compared the frequency of sites within each group of upstream sequences to the background frequency in upstream sequences of 194 genes that were expressed in T98G cells, but were not induced by PDGF. The analysis was restricted to 230 matrices that detected no more than one site per kilobase in these background sequences, in order to focus on the most informative matrices. To identify a collection of sites that were statistically over-represented in the groups of PDGF-induced genes, the mean number of sites for each matrix per upstream region in each of the 4 groups of co-expressed genes was compared with the mean number of sites per upstream region in the background set of non-induced genes. The distribution of predicted transcription factor binding sites in the background set of upstream regions was approximately normal (see Supplementary Fig. 1), so a one-tailed two-sample Student's t test was used to identify transcription factor binding sites that occurred more frequently on average in each set of co-expressed genes compared with the background (p ≤ 0.01). To independently validate the results of the t test, the analysis was compared with a more stringent non-parametric statistical method based on permutation testing. Following 10,000 iterations, ranked results from a permutation test revealed a set of statistically significant matrices that were similar to the Student's t test results. A comparison of the transcription factors identified by these two tests is discussed below (see Table II).
Table IITranscription factors with over-represented binding sites in the upstream sequences of PDGF-induced genes Summary of transcription factors with statistically over-represented (p ≤ 0.01) binding sites upstream of each group of co-expressed genes as assessed by the Student's t-test and the corresponding p values from the permutation test. Related transcription factors with similar binding sites are presented as a single family (for example, ATF/CREB). The component matrices represented by each factor can be found in Supplementary Table III. Transcription factors with binding sites limited to one group of co-expressed genes are indicated in bold.
The distribution of the transcription factor binding sites identified in each group of co-expressed genes is presented in Fig. 3. For each matrix, the average frequency of sites identified relative to background is plotted on the x-axis, and the percentage of genes containing at least one site on the y-axis. For most matrices, the average frequency of sites in the induced genes did not differ significantly from background. However, some matrices identified sites with high frequencies above background, generally in a substantial fraction of genes. The average frequency of sites identified by 40 matrices indicated statistical over-representation (p ≤ 0.01) in one or more groups (14 in the PI3K- and MEK/ERK-independent group, 25 in the PI3K-dependent group, 8 in the MEK/ERK-dependent group, and 13 in the PI3K- and MEK/ERK-dependent group). With a Student's t test p value threshold of 0.01, we expect one false positive (Type I) error in 100 such tests. Multiple hypothesis testing with the 230 matrices used in our analysis would thus be expected to yield 2.3 false-positives in the statistically significant matrices from each group of co-expressed genes. Therefore, the number of matrices identified in each group of co-expressed genes is substantially greater than would be expected by chance.
Confirmation of SRF Binding Sites—Several approaches were used to assess the validity of the computational predictions. First, the predicted transcription factor binding sites were compared with published experimental data. Second, predicted sites were analyzed for conservation in the mouse, as physiologically relevant transcription factor binding sites are frequently conserved in the non-coding regions of orthologous genes (
). Next, we asked whether the transcription factor(s) deduced from the predicted binding sites were known to be regulated by the relevant signaling pathway. In addition, the sites predicted by matrices that represent serum response factor (SRF) binding sites were further tested by chromatin immunoprecipitation.
A detailed example of the verifications for the well-studied transcription factor, SRF, is presented in Fig. 4. Consistent with activation of SRF by both PI3K and MEK/ERK pathways (
), the SRF matrix, V$SRF_C, detected a significant number of sites in genes induced by these pathways. Sixteen SRF binding sites (serum response elements, or SREs) were found in 10 promoter regions. Thirteen of these had previously been identified, verifying the computational predictions (Fig. 4A). In addition, there were 3 genes (CYR61, JUNB, and ETR101) reportedly regulated by SRF for which we did not identify a SRE. This was because the SRE for CYR61 occurs immediately outside the 1-kb window used for our analysis, while the SRE for JUNB is downstream of the gene (
); this site also occurs outside the 1-kb analysis window in both the mouse and human sequences.
Sites identified by the SRF matrix were further evaluated by comparison with orthologous mouse sequences. These sequences were available for seven of the ten SRE-containing human genes. Although these aligned sequences had low overall percent identities, 13 of the SREs were conserved: 8 were identical, 4 differed in a single matrix position, and one in two positions of low weight (Fig. 4B). In addition to 12 of 13 experimentally verified sites, we also identified an unreported, but conserved, SRE in the promoter of the RhoE (ARHE) gene.
In addition to these validations, SRF cis-element predictions were tested by chromatin immunoprecipitation to obtain direct experimental verification of the computational predictions within the cell system used. Chromatin from T98G cells was immunoprecipitated using an anti-SRF antibody, and quantitative PCR was used to detect enrichment for specific upstream regions (Fig. 5). GAPDH, a gene not regulated by SRF, was used as a negative control (
). In addition, four genes with no SRF binding sites detected were selected from the background set (not induced by PDGF) as predicted negatives.
The genes tested included those with cis-elements predicted by two SRF matrices, V$SRF_C (shown in Fig. 4) as well as a second SRF matrix in the TRANSFAC data base, V$SRF_Q6. It is noteworthy that V$SRF_Q6 was a less stringent matrix, and predicted SRF binding sites in the background set of promoter sequences at approximately a 5-fold greater frequency than V$SRF_C (0.19 per kb for V$SRF_Q6 compared with 0.04 per kb for V$SRF_C; see Supplementary Table II).
SRF binding to 8 of the 10 genes predicted by the V$SRF_C matrix was confirmed by the chromatin immunoprecipitation assays (Fig. 5). The promoter regions of each of these genes (EGR1, EGR2, FOSB, FOS, MCL1, SRF, NR4A1, and DUSP5) were significantly enriched (10- to >100-fold) in chromatin immunoprecipitates with anti-SRF antibody in comparison to GAPDH. As expected, the highest fold enrichment was obtained with EGR1, which contains 6 SRF binding sites. In contrast, the 4 predicted negative genes from the background set did not show any significant enrichment over GAPDH in anti-SRF chromatin immunoprecipitates. The genes for which SRF binding sites were demonstrated by this analysis in T98G cells included all 7 genes in which SRF binding sites had been previously observed in other systems (EGR1, EGR2, FOSB, FOS, MCL1, SRF, and NR4A1) as well as DUSP5, in which SRF binding had not been previously described. Despite the prediction of a conserved SRE in ARHE, we were unable to confirm this site experimentally.
The less stringent V$SRF_Q6 matrix detected all of the sites predicted by V$SRF_C, as well as additional sites in ETR101, CCL8, RGS2, SLC21A3, and TIEG. In contrast to the sites predicted by V$SRF_C, none of the additional sites predicted by V$SRF_Q6 demonstrated enrichment in chromatin immunoprecipitations (Fig. 5). Although ETR101 was clearly enriched in anti-SRF chromatin immunoprecipitates, these experiments cannot distinguish between SRF occupancy at the position computationally predicted by V$SRF_Q6 (-884) and the previously demonstrated site in the mouse ortholog outside of the 1 kb window (-1188), which is recognized by V$SRF_C. Because of the proximity of these sites, we think it is more likely that the positive chromatin immunoprecipitations reflect binding to the -1188 site, rather than to the -884 site predicted by V$SRF_Q6. It thus appears that the V$SRF_Q6 matrix predicted a higher number of false positive binding sites than V$SRF_C, consistent with the higher frequency of V$SRF_Q6 sites in the background set of promoters.
Networks of Regulated Gene Expression—We next sought to integrate the experimental data and our computational predictions into a transcriptional regulatory network. To generate this network, we combined the computational results from TRANSFAC matrices that were redundant or represented sites for families of related transcription factors. Thus, the 40 significant binding sites matrices identified in Fig. 3 corresponded to 18 unique transcription factors or families (Table II). For each of these factors, Table II indicates the p value of the most significant matrix as determined by both the Student's t test and the permutation test. 14 of 18 factors identified as highly significant by the t test were also scored as significant (p < 0.05) by the permutation test. However, 4 factors (CDP/Cut, OCT7, ROAZ, and RORα2) identified as significant by the Student's t test were not statistically significant by the permutation test. As discussed further below, it is noteworthy that the binding sites predicted for these factors were identified in only 1 or 2 target genes and were not supported by experimental evidence, suggesting that they may represent false positives in the Student's t test.
The network of genes regulated by all 18 factors is presented in Fig. 6. All genes identified as having binding sites predicted by any of the TRANSFAC matrices for these factors are included, although (as discussed above for V$SRF_Q6) some are expected to represent false positives corresponding to the frequency of sites predicted by each matrix in the background set of promoter sequences (see Supplementary Table II). In addition to SRF, predicted binding sites for STAT5, NF-κB, and ATF/CREB have been demonstrated experimentally (orange lines). At an additional level of confirmation, orthologous mouse sequences were obtained and aligned with 45 of the 64 human promoter regions (Supplementary Table IV). Within these regions, 50% of the predicted human binding sites were conserved in the mouse (green lines). For example, 36 ATF/CREB sites were detected in 23 human sequences for which a mouse ortholog was available. Twenty-three of these sites were conserved, 6 of which have been experimentally verified, supporting the role of ATF/CREB as a regulator of these genes.
Several of the predicted transcription factors are known targets of relevant signaling pathways (black lines). Binding sites for Forkhead (FOX) family members were over-represented among PI3K-dependent genes, consistent with regulation of Forkhead family members by PI3K/Akt signaling (
), were over-represented in the PI3K- and MEK/ERK-independent genes. Other factors, including SRF, were over-represented in multiple groups. For example, binding sites for ATF/CREB were over-represented in all 4 groups of genes, consistent with activation of CREB by cAMP/PKA signaling, as well as by PI3K/Akt and MEK/ERK/Rsk-2 (
). Overall, the regulation of 7 of the 18 predicted transcription factors was consistent with previous experimental data.
In combination, the conservation of predicted human regulatory elements in orthologous mouse genes and previous experimental verification of either predicted transcription factor binding sites or their cognate transcription factors provided validation for 11 of the 18 transcription factors that were predicted by our analysis (ATF/CREB, NF-1/myogenin, STAT1/5, MEF2, NFκB, SRF, C/EBPα, Forkhead, Nkx2–5, OCT1/2, and PBX1). Predicted binding sites for most of these factors were identified in upstream sequences of multiple genes in each co-expressed group (Fig. 6), consistent with the hypothesis that common transcription factor binding sites would be shared among co-expressed immediate early genes. Of the 18 unique predictions, 14 were confirmed by the permutation test (Table II). It is noteworthy that the 4 factors not confirmed by the permutation test (CDP/Cut, OCT7, ROAZ, and RORα2) were also not validated by either experimental data or conservation in the mouse. Moreover, binding sites for 3 of these factors (OCT7, CDP/Cut, and ROAZ) were predicted in only a single gene and binding sites for RORα2 in only two genes. These factors may thus represent false positives, in contrast to the physiologically significant factors that have predicted binding sites in a number of co-expressed genes.
The agreement of many of our predictions with previous experimental data, the conservation of predicted sites in the mouse, and the direct validation of SRF binding sites by chromatin immunoprecipitation demonstrates the presence of common cis-regulatory elements in groups of co-expressed human genes. A critical element of this analysis was the experimental grouping of genes based on their regulation by specific signaling pathways that directly target transcription factors. By focusing on the specific induction of immediate early genes, we were able to establish a direct relationship between groups of genes and their transcriptional regulators. This allowed statistical analysis of the frequencies of regulatory elements in groups of co-expressed genes, addressing the problem of frequently occurring sequences that resemble transcription factor binding sites in genomic DNA. The accuracy of the identification of transcription factor binding sites in groups of co-expressed genes is coupled to both the stringency of the statistical analysis and the results of phylogenetic footprinting. Although we expect false positives in the cis-elements identified in individual genes, corresponding to the background associated with each matrix, the high frequencies of particular transcription factor binding sites in the co-expressed gene groups substantiates these factors as likely targets of the relevant signaling pathways. Additional computational improvements would be expected to further enhance the power of this approach. Such improvements might include the development of better-defined matrices for identification of transcription factor binding sites, as indicated by the false positives revealed by the experimental validations of the V$SRF_C and V$SRF_Q6 predictions, as well as analysis of clustered transcription factor binding sites (