![]()
|
|
||||||||
J. Biol. Chem., Vol. 279, Issue 45, 46779-46786, November 5, 2004
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


¶
||
From the
Section of Developmental Pharmacology and Experimental Therapeutics, Division of Pediatric Clinical Pharmacology and Medical Toxicology and ¶Laboratory of Human Molecular Genetics, Children's Mercy Hospital and Clinics, Kansas City, Missouri 64108
Received for publication, July 26, 2004 , and in revised form, August 13, 2004.
| ABSTRACT |
|---|
|
|
|---|
| INTRODUCTION |
|---|
|
|
|---|
Traditionally, computational identification of protein binding sites in DNA has relied on the use of consensus sequences initially defined experimentally. The strengths of binding sites can be ranked by principal component analysis, and their relative affinity can be estimated relative to a consensus reference sequence on a logarithmic scale (10). However, consensus sequences occur infrequently (11), and because they represent sites with strong binding affinity, they may not accurately measure the strengths of weak and intermediate sites, which represent the majority of protein recognition elements. A significant and generally unappreciated problem of identifying protein binding sites in DNA is this underlying bias toward consensus sequences that has hampered identification of functional, naturally occurring binding sites with lower affinities for cognate proteins.
An alternative approach to identifying and characterizing protein-DNA interactions is a computational method based on information theory (1214). Information theory-based models of protein-DNA interactions show which nucleotides are permissible at both highly conserved and variable positions of binding sites (1416). The average information of a set of related sequences, Rsequence, connotes the overall conservation of a set of DNA sites bound by the same recognizing protein(s) (17), whereas specific binding sites that are members of this set can be ranked by their corresponding distribution of individual information contents (Ri, measured in bits of information) (16). Strong sites have Ri values >Rsequence, and weak sites are those with Ri <Rsequence. For models based on large numbers of binding sites, the distribution of Ri values is approximately gaussian (15). Both information theory and experimental studies demonstrate that nonfunctional sites have Ri values
0 bits of information (18, 19). The zero coordinate on the Ri distribution can also be understood from a thermodynamic viewpoint. Ri values >0 bits correspond to binding sites, because as entropy increases, energy is dissipated upon protein binding to the nucleic acid sequence. From the second law of thermodynamics, a 1-bit change in information corresponds to at least a 2-fold change in binding strength. Selection of the most frequent base at each position of the Ri(b,l) weight matrix produces the consensus sequence, which therefore has the largest Ri value and represents the upper bound of the distribution of Ri values. Through its relationship to thermodynamic entropy, the information measure reflects binding energy and has been extremely useful in understanding the consequences of allelic variants, which affect splicing of genes related to hereditary nonpolyposis colon cancer (19) and atherosclerosis (18).
In this study, we have employed information theory to develop and refine a model of PXR binding elements. This model was validated by scanning the 5' regulatory regions of genes regulated by PXR ligands and by testing novel PXR response elements (PXREs) for binding and transcriptional activation. Using this model, we have accurately and comprehensively identified PXREs in genomic sequences and have predicted, in silico, several novel human genes that are potentially regulated by PXR through these enhancer elements and thereby provided novel insights into the origin and scope of the PXR response.
| MATERIALS AND METHODS |
|---|
|
|
|---|
![]() | (Eq. 1) |
|
![]() | (Eq. 2) |
Scanning with PXR/RXR ModelsTen kb of the 5' promoter regions of genes shown previously to be activated by PXR ligands in microarray experiments (21) were evaluated for PXR/RXR binding sites. For analyses of specific promoters, sequences were retrieved from GenBankTM or from the genome draft, and both strands were scanned with the PXR/RXR information model to identify putative binding sites with Ri >0 bits.
For genome-wide analysis, the April 2003 human genome reference sequence was scanned with the Delila-Genome system (22) to identify putative binding sites within 10 kb (upstream and downstream) of the start sites of both known and novel PXR gene targets. Rfrequency represents the amount of information required to find a site in the genome (17),
![]() | (Eq. 3) |
is the number of sites in the genome, and G is the size of the genome. To characterize the results of this scan, accession numbers of mRNAs corresponding to genes with predicted PXR binding sites with Ri values exceeding Rsequence were parsed from the Genome Browser custom track and then batch-imported into MatchMiner (23) to obtain a list of Human Genome Organization-approved gene symbols (n = 1825). After removing duplicate loci, the global PXR response was categorized by examining the ontologies of those symbols that are annotated in UniProt (n = 579) (24).
PlasmidsExpression plasmids containing the cDNA for hRXR
(pSG5-hRXR
) and hPXR (pSG5-hPXR
ATG) (4) were kindly provided by Steve Kliewer (University of Texas, Southwestern) and Linda Moore (GlaxoSmithKline). Oligonucleotides containing the indicated response elements were annealed, phosphorylated with T4 polynucleotide kinase, and ligated into pGL3-promoter (Promega, Madison, WI) to generate firefly luciferase reporter constructs used for transient transfections. The plasmids were sequenced to confirm the presence and sequence of insertions.
Electrophoretic Mobility Shift Assays (EMSAs)hPXR and hRXR
were translated from pSG5-hPXR
ATG and pSG5-hRXR
with the TNT transcription/translation system (Promega) according to the manufacturer's directions. Binding reactions (20 µl) contained 10 mM HEPES, pH7.8, 60 mM KCl, 0.2% Nonidet P-40, 6% glycerol, 10 ng/ml poly(dI·dC), 1 mM dithiothreitol, and 1 µl each of hRXR
and hPXR or 2 µl of unprogrammed lysate. Reactions including the competitor oligonucleotides (0.130 pmol) were incubated on ice for 10 min before the addition of 0.01 pmol 32P-labeled CYP3A4 proximal PXRE oligonucleotide. The reactions were incubated on ice for an additional 15 min. Bound and unbound DNA were separated on a 5% polyacrylamide gel, which was run in 0.5x TBE (50 mM Tris, 45 mM boric acid, 0.5 mM EDTA). DNA-protein complexes were visualized by autoradiography and quantitated by densitometry on a Kodak Digital Sciences Image Station 440CF. Oligonucleotides used in this study can be found in Supplementary Table 1.
Transient Transfection AssaysHepG2 cells, obtained from ATCC (Manassas, VA), were plated in 12-well plates in Dulbecco's modified Eagle's medium, without phenol red (Invitrogen), supplemented with 5% fetal bovine serum 24 h before transfection. The cells were transfected with 100 ng of the indicated firefly luciferase reporter construct, 25 ng of pSG5-hPXR
ATG, and 5 ng of pRL-CMV with LipofectAMINE Plus (Invitrogen) according to the manufacturer's recommendations. Three hours after transfection, the cells were exposed to 0.1% Me2SO or 10 µM rifampin (Sigma) in Dulbecco's modified Eagle's medium without phenol red, supplemented with 2.5% fetal bovine serum for 2448 h. The cells were harvested in passive lysis buffer (Promega), and luciferase activity was detected using the Dual-Luciferase reporter assay system from Promega in 10 µl of cell lysate with a Lumat LB 9507 (EG & G Berthold).
Quantitative Real-time PCRRNA was isolated from HepG2 cells treated for the indicated times with 10 µM rifampin or Me2SO using TRIzol reagent (Invitrogen) according to the manufacturer's recommendations. Quantitative RT-PCR reactions were performed with 20 ng of total RNA using the QuantiTectTM SYBR® Green RT-PCR kit (Qiagen, Valencia, CA) using CASP10-specific primers C614 (5'-ccg agt cgt atc aag gag agg aag aac-3') and 3BRIDGE (5'-tat atg cac tgt gaa ccc aag cca-3') (25).
Statistics and Curve FittingValidation set data were fitted using nonlinear logistic regression with Ri (estimate of predicted affinity) as the independent variable and IC50 values (surrogate measure of observed affinity) as the dependent variable in SigmaPlot 2000. To include sites for which an IC50 value could not be determined at concentrations of oligonucleotide tested in EMSAs, IC50 values were rank-ordered 112 from lowest to highest, and a rank of 14 was assigned as the average rank for all nonbinding sites. Spearman's
correlation coefficients were determined in SPSS (version 12.0; SPSS Inc., Chicago, IL) between predicted and observed binding affinities.
| RESULTS |
|---|
|
|
|---|
|
|
|
The three novel PXR/RXR binding sites identified by Model 1 and the recently described distal PXRE from the CYP2B6 gene promoter (26) were aligned with the original set of PXREs to generate Model 2. The resulting sequence logo (Fig. 2b) has an Rsequence of 17.00 ± 3.52 bits. Model 2 is still biased toward sites with higher information content but begins to approach a normal distribution over the range of sites aligned (Fig. 2b). The novel PXREs identified with Model 1 were re-evaluated with Model 2. The information content for all three sites is increased in the revised model, and the minimum theoretical changes in binding affinity are no longer overestimated for the sites detected in the UGT1A3 and UGT1A6 gene promoters (Table I).
We developed a strategy to eliminate consensus sequence bias by introducing weaker sites into the model. The promoters of additional genes known from published literature and microarray studies to be induced by PXR/RXR ligands were scanned for potential binding sites using Model 2. Thirty-five sites (Supplementary Table 1) that had at least 6 bits of information were identified in thirteen gene promoters and were evaluated with EMSAs. Candidate sequences that were able to competitively inhibit the binding of PXR/RXR to the CYP3A4 pPXRE in a concentration-dependent manner were added to those in Model 2. Model 3 represents the alignment of 32 PXR/RXR binding sites (19 from Model 2 plus 13 novel sites) and reduced Rsequence to 14.94 ± 4.04 bits (Fig. 2c). The distribution of Ri values for sites in Model 3 approaches a normal distribution between 5.7 and 21.7 bits of information (Fig. 2c), indicating the presence of weaker sites in Model 3 relative to Models 1 and 2.
An element from the NOS2A gene promoter previously characterized as a strong PXRE (7) has an information content of 5.68 bits when evaluated with Model 3, suggesting it is a weak binding site for PXR/RXR. This PXRE has base changes at positions 0 and 10 compared with the optimal PXRE predicted by Model 3 (G instead of A; GGTTCAtgccGGTTCA). Both positions show a high degree of conservation (12 bits of information), and adenine occurs most frequently (A on top of stack) in Model 3 (Fig. 2c). To test the requirement of adenine at positions 0 and 10, competition EMSAs were performed with 14 oligonucleotides containing PXREs aligned in Model 3 with guanine substituted for adenine at positions 0 and 10. All of the synthetic sequences were capable of competing for binding of PXR/RXR to the pPXRE from the CYP3A4 gene promoter (data not shown). Model 4 was generated by the addition of these synthetic sequences to the alignment. Model 4 has an Rsequence of 14.43 ± 3.21 bits (Fig. 2d) and contains 48 sites ranging between 7.2 and 20.7 bits of information (Fig. 2d). The Ri value for the PXRE from NOS2A increased from 5.78 to 13.99 bits of information (or a 294-fold increase in predicted affinity), better predicting the strong affinity of PXR/RXR for this site.
An independent validation set of fifteen predicted binding sites in the human genome was developed to determine the ability of Model 4 to predict PXR/RXR binding site affinity accurately. Competition EMSAs were performed with labeled pPXRE from the CYP3A4 gene promoter to assess the relative binding affinities of sites in the validation set. Twelve of the fifteen predicted sites competed for PXR/RXR binding at the concentrations tested, and IC50 values were determined as described under "Materials and Methods." The validation set data were fitted using nonlinear logistic regression. Ri values derived from Model 4 provided better prediction of binding affinity (r2 = 0.50, p < 0.05) versus Model 3 (r2 = 0.19), which was not significantly different from Models 2 (r2 = 0.06) or 1 (r2 = 0.26). However, this analysis disregards three sites that did not compete for PXR/RXR binding. To incorporate the nonbinding sites, IC50 values were ranked with binding sites ordered 112, and the three nonbinding sites were assigned the average rank of 14, because relative binding affinities could not be determined. Using Spearman's
, a significant correlation was observed between Ri and IC50 for Models 3 (
= 0.649, p = 0.01) and 4 (
= 0.628, p = 0.01) but not for Models 1 or 2 (
= 0.355, p = 0.194). The lack of superiority of Model 4 over Model 3 with this analysis may be due to the fact that Model 4 was the result of adding synthetic sequences to Model 3, in which positions 0 and 10 were converted from adenine to guanine to assess the requirement of adenine at this position. The validation set contained only two sites where improvements in the model could be tested (guanosine at positions 0 and 10).
Model 4 was used to scan the human genome draft to identify PXR/RXR binding sites with >10 bits of information within 10 kb of transcribed genes. The scan of Build 33 (April, 2003 release) resulted in 355,831 PXR/RXR binding sites with Ri values >10 bits, 190,490 of which were unique sequences. Fig. 4 displays a frequency histogram of Ri values of sites predicted by Model 4. As expected from information theory (28), as Ri approaches the individual information content of the optimal PXR/RXR binding site predicted by the model, the number of sites predicted in the genome decreases. In fact, the strongest site in the human genome has an Ri value of 22.95 bits, 2.29 bits less than the 25.24 bits of information in the optimal, or "consensus," PXR/RXR binding site predicted by the model. This translates into a binding affinity of at least 4.86-fold less than the PXR/RXR binding to the optimal binding site sequence.
|
A scan of promoters in the human genome with Model 4 identifies several sites within the regulatory regions of genes not previously recognized to be regulated by PXR. One example is the CASP10 gene promoter, which contains a cluster of three sites with Ri values of 14.13, 14.66, and 18.69 bits between 7 and 8 kb upstream (Fig. 3d). Caspase 10 is an initiator caspase that initiates apoptosis following proteolytic cleavage in response to the binding of death receptors to ligands and is regulated transcriptionally as well as post-translationally (29). The 18-bit site (Table I) was tested to determine whether (and how well) it was bound by PXR/RXR heterodimers. This site competed effectively for binding of PXR/RXR, however it had an apparent affinity 6.2-fold less than that observed with the CYP3A4 pPXRE (Table I).
To determine whether the novel PXR/RXR binding sites identified with the model, as well as the predicted optimal PXR/RXR binding site, could modulate transcription, luciferase activity was measured in HepG2 cells transiently transfected with reporter constructs containing two copies of the PXREs upstream of the SV40 basal promoter of the pGL3-promoter. Rifampin (10 µM) induced luciferase activity in cells transfected with constructs containing the pPXRE from the CYP3A4 gene promoter, the distal PXRE from the CYP2B6 gene promoter, the 18-bit site from the CASP10 gene promoter, and the optimal PXR/RXR binding site (Fig. 5a). Rifampin did not affect luciferase activity in cells transfected with the pGL3-promoter vector in the absence of any upstream PXREs. Induction of transcription through the PXR/RXR binding site from the CASP10 gene promoter was independent of orientation (sense or antisense) relative to the start of transcription, suggesting that it may act as a true enhancer element. In contrast, rifampin had no effect on luciferase activity in cells transiently transfected with reporter constructs containing two copies of the PXR/RXR binding sites from the UGT1A3 or UGT1A6 gene promoters, implying that PXR/RXR binding to these sequences alone is not sufficient to drive induction of these genes.
|
| DISCUSSION |
|---|
|
|
|---|
As a result, many existing sources of transcription factor binding site data do not adequately describe the range of natural variation present in such sites. Commonly used promoter prediction software does not typically reveal this underlying deficiency (35), whereas this bias is clearly evident from information theory-based models. Furthermore, boundaries of sites developed from consensus sequences are often arbitrarily defined (e.g. TRANSFAC, www.biobase.de) and tend to ignore the information from weakly conserved positions adjacent to core recognition sequences. By contrast, information analysis optimally aligns sets of binding sites by minimizing uncertainty, and the contributions of all nucleotide positions where information exceeds background noise are represented.
We have developed an information theory-based model of the PXR responsive element beginning with fifteen previously reported binding sites for PXR/RXR and subjecting this initial model to three rounds of iterative refinement by adding naturally occurring and synthetic binding sites (Figs. 1 and 2). Model 4 is the result of the alignment of 48 binding sites, and although some bias toward strong sites remains, the strengths of sites in the model are normally distributed across the range of Ri values represented. Furthermore, Model 4 is biased toward a DR4, most likely because of the prevalence of DR4 sites versus DR3 or ER6 binding sites in the initial alignment of fifteen sites. It is conceivable that certain DR3, ER6, or IR0 sites may not be identified as PXR/RXR binding sites or that their strengths may be incorrectly estimated, a limitation of the current model. A bipartite model of PXR/RXR binding sites is currently under development to accommodate variable spacing between the two half-sites (36).
In Model 4 of the PXR/RXR binding site, the 5' half-site is less well conserved than the 3' half-site. Previous studies of other RXR heterodimers have shown that RXR occupies the upstream half-site, whereas its partner occupies the downstream half-site on repeats separated by 25 bp (3740). The increased information of the 3' half-site may direct, in part, specificity of the element to PXR/RXR heterodimers, whereas the decreased information of the 5' half-site may reflect the requirement for RXR to recognize a wider variety of sequence motifs through its dimerization with numerous other partners.
Surprisingly, validated moderate to high affinity sites in the promoters of the UGT1A gene family members did not confer responsiveness to rifampin in transient transfection assays in the HepG2 cell line. Although we cannot exclude the possibility that these sites may actually be recognized by the closely related constitutive androstane receptor (NR1I3) that also heterodimerizes with RXR, these observations may instead indicate that binding alone may be insufficient for transactivation by PXR/RXR. Ligand-induced interactions with coactivators, such as SRC-1, or other transcription factors may be required (4, 41, 42). For example, interactions between PXR or the constitutive androstane receptor and hepatocyte nuclear factor 4
are required for basal and inducible expression of CYP3A4 (43). This is consistent with the possibility that PXR binding sites with moderate amounts of information may require such interaction(s) with coactivators or other trans-acting factors to initiate transcription.
Rfrequency reflects the amount of information required to distinguish a binding site from all possible sites in the genome and is determined from the size of and number of sites in the genome (44, 45). For Model 4, Rfrequency (28) is 13.1 bits, which falls within the model variance. However, the Rsequence value for Model 4 (14.43 bits) predicts 139,038 sites to be found in the human genome. Rfrequency is often similar to Rsequence (28), because the transcription factor state space is constrained for DNA binding proteins that operate on the genome in which they are encoded (46). However, the genome scan with Model 4 detected 355,831 potential PXR/RXR binding sites, with Ri > 10 bits within 10 kb (upstream and downstream) of the transcription start sites. Possible explanations for this discordant result include: (a) lack of specificity for detecting functional PXR/RXR binding sites, (b) potential overlap and co-recognition with other related nuclear receptor recognition sequences, or (c) Rfrequency is underestimated, because only a subset of the genome is in fact accessible to PXR/RXR.
The optimal or consensus PXR/RXR binding site predicted by the model does not occur in the human genome, consistent with previous genome-wide analyses of other types of binding sites (22). The strongest PXR/RXR binding site in the human genome identified by Model 4 contains 22.96 bits of information, less than the 25.24 bits of information in the optimal or consensus site predicted by information theory. This may be biologically significant for the regulation of gene expression following xenobiotic exposure and, indeed, more relevant for transcription factor-DNA interactions in general. For example, PXR rapidly induces gene expression in the presence of a xenobiotic challenge but equally important must be the ability to switch off the response after the challenge is resolved. PXR/RXR heterodimers could bind and dissociate from their cognate binding sites through intermediate strength protein-DNA interactions of PXREs, with Ri
Rsequence. By contrast, it is less plausible that the protein would be easily displaced by basal transcription factors to relieve strong, high affinity interactions at PXREs resembling consensus sequences (Ri » Rsequence). This contention is supported by the difference in binding affinity between the strongest identified binding site in the genome and the consensus sequence.
The most frequently occurring PXR/RXR binding site in the human genome is contained within an L1 retrotransposon sequence. It has been proposed that plant-animal warfare fueled a dramatic increase in the number of cytochrome P450 isoforms
400 million years ago (47), and it is tempting to speculate that random insertion of retrotransposons containing PXRE-like binding motifs throughout the genome may have conferred a selective advantage to individuals in the population when insertion occurred upstream of genes capable of minimizing the adverse consequences of potentially harmful xenobiotics from dietary sources (i.e. via biotransformation and enhanced elimination). Genomic insertion of transposable elements conferring a selective advantage for the host has been demonstrated in the apolipoprotein(a) gene promoter, which contains an enhancer derived from L1 sequences (48).
Excluding the PXREs found in L1 elements, the human genome scan also identified 1825 elements within the 5' regions of genes not previously recognized as PXR targets as well as within promoters of unannotated genes deduced from mRNAs mapped onto the genome sequence. Some potential targets of PXR identified by the model, including coproporphyrinogen oxidase and aldehyde dehydrogenase 1A1, have recently been described to be regulated in mice overexpressing human PXR (49). This illustrates the potential of transcription factor binding site models to provide information complimentary to data from microarray studies. In addition, the genome scan with the PXR binding site model suggests that pathways affected by the PXR-mediated drug response may extend beyond drug biotransformation and transport to include transcription, apoptosis, signal transduction, and cell-cycle control, as shown in Table II (for all gene ontologies, see Supplementary Table 2).
|
Information analysis identified presumptive PXREs (with Ri > Rsequence) in the promoters of 1236 unannotated genes. As an example, a 19-bit site (chrX:118,291,236118,291,259) was associated with a transcript that did not appear to be in close proximity to a previously described human gene. Upon closer inspection, genes with strong similarity to the human sequence were present in several mammalian species, the closest reported ortholog being from Sanguinus oedipus, a new world monkey. The putative PXR/RXR binding site is located 48 bp upstream of the matching mRNA sequence (GenBankTM accession number AF142571 [GenBank] ), which is likely to be an intracellular vitamin D binding protein, a particularly intriguing finding given the close similarity between PXR (NR1I2) and the vitamin D receptor (NR1I1).
Transcription factor binding site models using information theory, such as the PXR binding site model described here, can be used to predict response elements in established, regulated genes as well as novel transcription factor targets. Furthermore, development of additional transcription factor binding site models will aid in the identification of cooperative interactions between trans-acting transcription factors and identify signatures of coordinately regulated genes. Integration of in silico approaches with the results of gene expression array analyses has the potential to dissect hierarchies of genes that are targets of the PXR genomic response as well as other trans-acting elements that modulate gene expression.
| FOOTNOTES |
|---|
The on-line version of this article (available at http://www.jbc.org) contains Supplementary Tables 1 and 2. ![]()
These authors contributed equally to this study. ![]()
|| To whom correspondence should be addressed: Section of Developmental Pharmacology and Experimental Therapeutics, Div. of Pediatric Clinical Pharmacology and Medical Toxicology, Children's Mercy Hospital and Clinics, 2401 Gillham Rd., Kansas City, MO 64108. Tel.: 816-234-3059; Fax: 816-855-1958; E-mail: sleeder{at}cmh.edu.
1 The abbreviations used are: PXR, pregnane X receptor; PXRE, PXR response element; DR3, direct repeat separated by 3 base pairs; DR4, direct repeat separated by 4 base pairs; EMSA, electrophoretic mobility shift assay; ER6, everted repeat separated by 6 base pairs; IR0, inverted repeat separated by 0 base pairs; Rfrequency, information required to find a site in the genome; Ri, information content of an individual member of a set of sequences; Rsequence, average information content of a set of related sequences; RXR, 9-cis retinoic acid receptor; RT, reverse transcription. ![]()
| ACKNOWLEDGMENTS |
|---|
| REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
I. G. Lyakhov, A. Krishnamachari, and T. D. Schneider Discovery of novel tumor suppressor p53 response elements using information theory Nucleic Acids Res., June 1, 2008; 36(11): 3828 - 3833. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Chen, K. A. Lewis, R. K. Shultzaberger, I. G. Lyakhov, M. Zheng, B. Doan, G. Storz, and T. D. Schneider Discovery of Fur binding site clusters in Escherichia coli by information theory models Nucleic Acids Res., November 7, 2007; 35(20): 6762 - 6777. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. K.-H. Chow, B. Razani, and G. Cheng Innate immune system regulation of nuclear hormone receptors in metabolic diseases J. Leukoc. Biol., August 1, 2007; 82(2): 187 - 195. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Elnitski, V. X. Jin, P. J. Farnham, and S. J.M. Jones Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques Genome Res., December 1, 2006; 16(12): 1455 - 1464. [Abstract] [Full Text] [PDF] |
||||