Development and Refinement of Pregnane X Receptor (PXR) DNA Binding Site Model Using Information Theory INSIGHTS INTO PXR-MEDIATED GENE REGULATION

The pregnane X receptor (PXR) acts as a receptor to induce gene expression in response to structurally diverse xenobiotics through binding as a heterodimer with the 9-cis retinoic acid receptor (RXR) to enhancers in target gene promoters. We identified and estimated the affinities of novel PXR/RXR binding sites in regulated genes and additional genomic targets of PXR with an information theory-based model of the PXR/RXR binding site. Our initial PXR/RXR model, the result of the alignment of 15 previously characterized binding sites, was used to scan the promoters of known PXR target genes. Sites from these genes, with information contents of > 8 bits bound by PXR/RXR in vitro , were used to revise the information weight matrix; this procedure was repeated by screening for progressively weaker binding sites. After three iterations of refinement, the model was based on 48 validated PXR/RXR binding sites and has an average information content ( R sequence ) of 14.43 (cid:1) 3.21 bits. A scan of the human genome predicted novel PXR/RXR binding sites in the promoters of UGT1A3 (19.78 bits at (cid:2) 8040 and 16.37 bits at (cid:2) 6930) and UGT1A6 (12.74 bits

The pregnane X receptor (PXR 1 ; also referred to as SXR, PAR, NR1I2) is a ligand-activated transcription factor that heterodimerizes with the 9-cis retinoic acid receptor (RXR, NR2B) and binds response elements in the promoters of regulated genes to induce gene expression (for review, see Kliewer et al. (1)). Human PXR is activated by numerous endogenous and xenobiotic compounds and induces the expression of several genes, such as CYP3A4, CYP2B6, UGT1A1, ABCB1, and MRP2, leading to the suggestion that PXR plays a significant role as a xenobiotic sensor that limits exposure following exogenous challenge. To mediate this adaptive response, PXR/RXR heterodimers bind to repeats of the hexanucleotide AGG/TTCA (2), as do related RXR partners, the constitutive androstane receptor, peroxisome proliferator-activated receptor, and the vitamin D receptor. For example, human PXR/RXR heterodimers interact with an everted repeat separated by six bp (ER6) from the CYP3A4 proximal promoter, but maximal induction of CYP3A4 gene expression apparently requires an additional DR3 (direct repeat with 3-bp spacer) element in a distal xenobiotic responsive enhancer module 8 kb upstream of the transcription start site (3)(4)(5). DR4 (direct repeat with 4-bp spacer) motifs have been identified in the ABCB1, NOS2A, and CYP2B6 gene promoters (6 -8), whereas binding to an IR0 (inverted repeat without spacer nucleotides) has also been reported (9). The ability of PXR/RXR heterodimers to interact with a diverse set of response elements has hindered identification of the full complement of target genes that mediate the transcriptional response to PXR ligands.
Traditionally, computational identification of protein binding sites in DNA has relied on the use of consensus sequences initially defined experimentally. The strengths of binding sites can be ranked by principal component analysis, and their relative affinity can be estimated relative to a consensus reference sequence on a logarithmic scale (10). However, consensus sequences occur infrequently (11), and because they represent sites with strong binding affinity, they may not accurately measure the strengths of weak and intermediate sites, which represent the majority of protein recognition elements. A significant and generally unappreciated problem of identifying protein binding sites in DNA is this underlying bias toward consensus sequences that has hampered identification of functional, naturally occurring binding sites with lower affinities for cognate proteins.
An alternative approach to identifying and characterizing protein-DNA interactions is a computational method based on information theory (12)(13)(14). Information theory-based models * This work was funded by Grant R01 ES10855-04 from the National Institutes of Health. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact. □ S The on-line version of this article (available at http://www.jbc.org) contains Supplementary Tables 1 and 2. § These authors contributed equally to this study. ʈ To whom correspondence should be addressed: Section of Developmental Pharmacology and Experimental Therapeutics, Div. of Pediatric Clinical Pharmacology and Medical Toxicology, Children's Mercy Hospital and Clinics, 2401 Gillham Rd., Kansas City, MO 64108. Tel.: 816-234-3059; Fax: 816-855-1958; E-mail: sleeder@cmh.edu. 1 The abbreviations used are: PXR, pregnane X receptor; PXRE, PXR response element; DR3, direct repeat separated by 3 base pairs; DR4, direct repeat separated by 4 base pairs; EMSA, electrophoretic mobility shift assay; ER6, everted repeat separated by 6 base pairs; IR0, inverted repeat separated by 0 base pairs; R frequency , information required to find a site in the genome; R i , information content of an individual member of a set of sequences; R sequence , average information content of a set of related sequences; RXR, 9-cis retinoic acid receptor; RT, reverse transcription.
of protein-DNA interactions show which nucleotides are permissible at both highly conserved and variable positions of binding sites (14 -16). The average information of a set of related sequences, R sequence , connotes the overall conservation of a set of DNA sites bound by the same recognizing protein(s) (17), whereas specific binding sites that are members of this set can be ranked by their corresponding distribution of individual information contents (R i , measured in bits of information) (16). Strong sites have R i values ϾR sequence , and weak sites are those with R i ϽR sequence . For models based on large numbers of binding sites, the distribution of R i values is approximately gaussian (15). Both information theory and experimental studies demonstrate that nonfunctional sites have R i values Յ0 bits of information (18,19). The zero coordinate on the R i distribution can also be understood from a thermodynamic viewpoint. R i values Ͼ0 bits correspond to binding sites, because as entropy increases, energy is dissipated upon protein binding to the nucleic acid sequence. From the second law of thermodynamics, a 1-bit change in information corresponds to at least a 2-fold change in binding strength. Selection of the most frequent base at each position of the R i (b,l) weight matrix produces the consensus sequence, which therefore has the largest R i value and represents the upper bound of the distribution of R i values. Through its relationship to thermodynamic entropy, the information measure reflects binding energy and has been extremely useful in understanding the consequences of allelic variants, which affect splicing of genes related to hereditary nonpolyposis colon cancer (19) and atherosclerosis (18).
In this study, we have employed information theory to develop and refine a model of PXR binding elements. This model was validated by scanning the 5Ј regulatory regions of genes regulated by PXR ligands and by testing novel PXR response elements (PXREs) for binding and transcriptional activation. Using this model, we have accurately and comprehensively identified PXREs in genomic sequences and have predicted, in silico, several novel human genes that are potentially regulated by PXR through these enhancer elements and thereby provided novel insights into the origin and scope of the PXR response.

Information Theory-based Model Building for PXR/RXR Binding
Sites-A binding site model for the PXR/RXR heterodimer was built from 15 previously reported PXR-response elements. Sequences were retrieved from GenBank TM and encompassed 20 bp upstream and 40 bp downstream of the putative 5Ј nucleotide of the enhancer element. The sequences were aligned to minimize uncertainty (20), from which information weight matrices and individual information contents were computed (14,16). Subsequently, the model was refined by the addition of validated binding site sequences found in scans of promoters of the established PXR/RXR gene targets, synthetic sites predicted from the information theory-based model, and novel gene targets (Fig. 1). The average information (in bits) of a related set of sequences, R sequence , represents the total sequence conservation (17), where f(b,l) is the frequency of each base b at position l, and e(n(l)) is a correction for the small sample size n at position l.
The individual information, R i , of a single member of a sequence family j is the dot product of its unitary sequence vector, s(b,l,j), and a weight matrix, R iw (b,l), based on the base frequencies at each position of the sequence (16).
Scanning with PXR/RXR Models-Ten kb of the 5Ј promoter regions of genes shown previously to be activated by PXR ligands in microarray experiments (21) were evaluated for PXR/RXR binding sites. For anal-yses of specific promoters, sequences were retrieved from GenBank TM or from the genome draft, and both strands were scanned with the PXR/RXR information model to identify putative binding sites with R i Ͼ0 bits.
For genome-wide analysis, the April 2003 human genome reference sequence was scanned with the Delila-Genome system (22) to identify putative binding sites within 10 kb (upstream and downstream) of the start sites of both known and novel PXR gene targets. R frequency represents the amount of information required to find a site in the genome (17), where ␥ is the number of sites in the genome, and G is the size of the genome.
To characterize the results of this scan, accession numbers of mRNAs corresponding to genes with predicted PXR binding sites with R i values exceeding R sequence were parsed from the Genome Browser custom track and then batch-imported into MatchMiner (23) to obtain a list of Human Genome Organization-approved gene symbols (n ϭ 1825). After removing duplicate loci, the global PXR response was categorized by examining the ontologies of those symbols that are annotated in UniProt (n ϭ 579) (24).
Plasmids-Expression plasmids containing the cDNA for hRXR␣ (pSG5-hRXR␣) and hPXR (pSG5-hPXR⌬ATG) (4) were kindly provided by Steve Kliewer (University of Texas, Southwestern) and Linda Moore (GlaxoSmithKline). Oligonucleotides containing the indicated response elements were annealed, phosphorylated with T4 polynucleotide kinase, and ligated into pGL3-promoter (Promega, Madison, WI) to generate firefly luciferase reporter constructs used for transient transfections. The plasmids were sequenced to confirm the presence and sequence of insertions.
Electrophoretic Mobility Shift Assays (EMSAs)-hPXR and hRXR␣ were translated from pSG5-hPXR⌬ATG and pSG5-hRXR␣ with the TNT transcription/translation system (Promega) according to the manufacturer's directions. Binding reactions (20 l) contained 10 mM HEPES, pH7.8, 60 mM KCl, 0.2% Nonidet P-40, 6% glycerol, 10 ng/ml poly(dI⅐dC), 1 mM dithiothreitol, and 1 l each of hRXR␣ and hPXR or 2 l of unprogrammed lysate. Reactions including the competitor oligonucleotides (0.1-30 pmol) were incubated on ice for 10 min before the addition of 0.01 pmol 32 P-labeled CYP3A4 proximal PXRE oligonucleotide. The reactions were incubated on ice for an additional 15 min. Bound and unbound DNA were separated on a 5% polyacrylamide gel, which was run in 0.5ϫ TBE (50 mM Tris, 45 mM boric acid, 0.5 mM EDTA). DNA-protein complexes were visualized by autoradiography and quantitated by densitometry on a Kodak Digital Sciences Image Station 440CF. Oligonucleotides used in this study can be found in Supplementary Table 1.
Transient Transfection Assays-HepG2 cells, obtained from ATCC (Manassas, VA), were plated in 12-well plates in Dulbecco's modified Eagle's medium, without phenol red (Invitrogen), supplemented with 5% fetal bovine serum 24 h before transfection. The cells were transfected with 100 ng of the indicated firefly luciferase reporter construct, 25 ng of pSG5-hPXR⌬ATG, and 5 ng of pRL-CMV with LipofectAMINE Plus (Invitrogen) according to the manufacturer's recommendations. Three hours after transfection, the cells were exposed to 0.1% Me 2 SO or 10 M rifampin (Sigma) in Dulbecco's modified Eagle's medium without phenol red, supplemented with 2.5% fetal bovine serum for 24 -48 h. The cells were harvested in passive lysis buffer (Promega), and luciferase activity was detected using the Dual-Luciferase reporter assay system from Promega in 10 l of cell lysate with a Lumat LB 9507 (EG & G Berthold).
Quantitative Real-time PCR-RNA was isolated from HepG2 cells treated for the indicated times with 10 M rifampin or Me 2 SO using TRIzol reagent (Invitrogen) according to the manufacturer's recommendations. Quantitative RT-PCR reactions were performed with 20 ng of total RNA using the QuantiTect TM SYBR® Green RT-PCR kit (Qiagen, Valencia, CA) using CASP10-specific primers C614 (5Ј-ccg agt cgt atc aag gag agg aag aac-3Ј) and 3BRIDGE (5Ј-tat atg cac tgt gaa ccc aag cca-3Ј) (25).
Statistics and Curve Fitting-Validation set data were fitted using nonlinear logistic regression with R i (estimate of predicted affinity) as the independent variable and IC 50 values (surrogate measure of observed affinity) as the dependent variable in SigmaPlot 2000. To include sites for which an IC 50 value could not be determined at concentrations of oligonucleotide tested in EMSAs, IC 50 values were rankordered 1-12 from lowest to highest, and a rank of 14 was assigned as the average rank for all nonbinding sites. Spearman's correlation coefficients were determined in SPSS (version 12.0; SPSS Inc., Chicago, IL) between predicted and observed binding affinities.

RESULTS
The PXR binding site model was developed and refined as depicted in Fig. 1. The initial PXR/RXR binding site weight matrix was based on PXREs from the CYP3A4, CYP3A7, ABCB1, CYP2B6, and CYP2C9 gene promoters. As described under "Materials and Methods," the length of aligned sequences was set to 61 bp to identify the potential variation adjacent to the core binding site. The resulting sequence logo for Model 1 (Fig. 2a) has an average information content, R sequence , of 17.06 Ϯ 4.62 bits and is suggestive of two half-sites comprising an imperfect direct repeat. The 3Ј half-site contains more information than the 5Ј repeat. Additional information is contained in the positions flanking the two half-sites as well as the intervening sequences, and the site window was set to 23 bp to incorporate this information. The lower information content in the sequence logo of the 5Ј half-site may indicate that the heterodimer is more tolerant of sequence variability, or it may reflect the multiple alignment of variably spaced halfsites, specifically ER6, DR3, and DR4 elements. The histogram in Fig. 2a shows the relative strength and frequency of the individual PXREs that contributed to Model 1 and reveals a bias toward sites with high individual information content (R i ; i.e. strong binding sites).
Model 1 was initially validated by scanning 10 kb of the CYP3A4 and CYP2B6 gene promoters to determine the strengths of previously established PXR binding sites. Fig. 3a shows the results of the scan and demonstrates that the proximal PXRE and xenobiotic responsive enhancer module (highlighted) of CYP3A4 was identified by Model 1. In addition, numerous weaker sites were identified, several of which formed a cluster of sites around the xenobiotic responsive enhancer module. Model 1 also identified the phenobarbital-responsive enhancer module and the recently described distal PXRE from the CYP2B6 gene promoter (26) (data not shown). The PXR/RXR Model 1 was also used to scan 10 kb of the 5Ј regulatory regions of the UGT1A3 and UGT1A6 genes (Fig. 3,  b and c), which were shown in microarray experiments to be induced by rifampin, a potent agonist for human PXR (21,27). Model 1 identified potential PXREs in both gene promoters with R i values Ͼ9.0 bits of information. Sequence walkers of novel PXREs identified in the gene promoters of UGT1A3 (10.86 bits, 10.66 bits) and UGT1A6 (9.88 bits) are shown in Table I.
Competition EMSAs were performed to test the binding affinities of PXREs identified in the UGT1A3 and UGT1A6 promoters relative to the proximal PXRE (pPXRE) from the CYP3A4 gene promoter. All three PXREs were able to compete for binding with the CYP3A4 pPXRE ( Fig. 1; data not shown). To quantify the relative binding affinity of PXR/RXR for novel sequences, the intensity of shifted complexes was measured as described under "Materials and Methods" and plotted versus concentration of the unlabeled competitor. We compared the concentration of competitor required to deplete binding by 50% (IC 50 ) to a similar competition with excess, unlabeled CYP3A4 pPXRE (Table I). Model 1 overestimates the minimum theoretical change in binding affinity for the PXREs from the UGT1A3 and UGT1A6 gene promoters, relative to the pPXRE from the CYP3A4 gene promoter (2 (Ri(CYP3A4)ϪRi(PXRE)) ). This is most likely because of the bias of Model 1 toward detection of strong binding sites and the low abundance of sites in Model 1 with R i values between 8.0 and 15.0 bits of information, the range of R i values for the PXREs evaluated in this study (Fig. 2a).
The three novel PXR/RXR binding sites identified by Model 1 and the recently described distal PXRE from the CYP2B6 gene promoter (26) were aligned with the original set of PXREs to generate Model 2. The resulting sequence logo (Fig. 2b) has an R sequence of 17.00 Ϯ 3.52 bits. Model 2 is still biased toward sites with higher information content but begins to approach a normal distribution over the range of sites aligned (Fig. 2b). The novel PXREs identified with Model 1 were re-evaluated with Model 2. The information content for all three sites is increased in the revised model, and the minimum theoretical changes in binding affinity are no longer overestimated for the sites detected in the UGT1A3 and UGT1A6 gene promoters (Table I).
We developed a strategy to eliminate consensus sequence bias by introducing weaker sites into the model. The promoters of additional genes known from published literature and microarray studies to be induced by PXR/RXR ligands were scanned for potential binding sites using Model 2. Thirty-five sites (Supplementary Table 1) that had at least 6 bits of information were identified in thirteen gene promoters and were evaluated with EMSAs. Candidate sequences that were able to competitively inhibit the binding of PXR/RXR to the CYP3A4 pPXRE in a concentration-dependent manner were added to those in Model 2. Model 3 represents the alignment of 32 PXR/RXR binding sites (19 from Model 2 plus 13 novel sites) and reduced R sequence to 14.94 Ϯ 4.04 bits (Fig. 2c). The distribution of R i values for sites in Model 3 approaches a normal distribution between 5.7 and 21.7 bits of information (Fig. 2c), indicating the presence of weaker sites in Model 3 relative to Models 1 and 2.
An element from the NOS2A gene promoter previously characterized as a strong PXRE (7)  Information Theory-based PXR Binding Site Model tion), and adenine occurs most frequently (A on top of stack) in Model 3 (Fig. 2c). To test the requirement of adenine at positions 0 and 10, competition EMSAs were performed with 14 oligonucleotides containing PXREs aligned in Model 3 with guanine substituted for adenine at positions 0 and 10. All of the synthetic sequences were capable of competing for binding of PXR/RXR to the pPXRE from the CYP3A4 gene promoter (data not shown). Model 4 was generated by the addition of these synthetic sequences to the alignment. Model 4 has an R sequence of 14.43 Ϯ 3.21 bits (Fig. 2d) and contains 48 sites ranging between 7.2 and 20.7 bits of information (Fig. 2d). The R i value for the PXRE from NOS2A increased from 5.78 to 13.99 bits of information (or a 294-fold increase in predicted affinity), better predicting the strong affinity of PXR/RXR for this site.
An independent validation set of fifteen predicted binding sites in the human genome was developed to determine the ability of Model 4 to predict PXR/RXR binding site affinity accurately. Competition EMSAs were performed with labeled pPXRE from the CYP3A4 gene promoter to assess the relative binding affinities of sites in the validation set. Twelve of the fifteen predicted sites competed for PXR/RXR binding at the concentrations tested, and IC 50 values were determined as described under "Materials and Methods." The validation set data were fitted using nonlinear logistic regression. R i values derived from Model 4 provided better prediction of binding affinity (r 2 ϭ 0.50, p Ͻ 0.05) versus Model 3 (r 2 ϭ 0.19), which was not significantly different from Models 2 (r 2 ϭ 0.06) or 1 (r 2 ϭ 0.26). However, this analysis disregards three sites that did not compete for PXR/RXR binding. To incorporate the nonbinding sites, IC 50 values were ranked with binding sites ordered 1-12, and the three nonbinding sites were assigned the average rank of 14, because relative binding affinities could not be determined. Using Spearman's , a significant correlation was observed between R i and IC 50 for Models 3 ( ϭ Ϫ0.649, p ϭ 0.01) and 4 ( ϭ Ϫ0.628, p ϭ 0.01) but not for Models 1 or 2 ( ϭ Ϫ0.355, p ϭ 0.194). The lack of superiority of Model 4 over Model 3 with this analysis may be due to the fact that Model 4 was the result of adding synthetic sequences to Model 3, in which positions 0 and 10 were converted from adenine to guanine to assess the requirement of adenine at this position. The validation set contained only two sites where improvements in the model could be tested (guanosine at positions 0 and 10).
Model 4 was used to scan the human genome draft to identify PXR/RXR binding sites with Ͼ10 bits of information within 10 kb of transcribed genes. The scan of Build 33 (April, 2003 release) resulted in 355,831 PXR/RXR binding sites with R i values Ͼ10 bits, 190,490 of which were unique sequences. Fig.  4 displays a frequency histogram of R i values of sites predicted by Model 4. As expected from information theory (28), as R i approaches the individual information content of the optimal PXR/RXR binding site predicted by the model, the number of sites predicted in the genome decreases. In fact, the strongest site in the human genome has an R i value of 22 bits less than the 25.24 bits of information in the optimal, or "consensus," PXR/RXR binding site predicted by the model. This translates into a binding affinity of at least 4.86-fold less than the PXR/RXR binding to the optimal binding site sequence.
The binding site most commonly identified by Model 4 with Ͼ10 bits of information was a site with 14.51 bits of information, very close to the average binding strength. This PXR/RXR binding sequence appears as a spike in the frequency distribution (Fig. 4) over other sites with this R i value and is significantly more abundant than predicted by a normal distribution. It occurs 4420 times in the genome with all examples occurring in long interspersed nuclear elements 1 (L1) retrotransposon sequences. Interestingly, this site is capable of competing for PXR/RXR binding to the CYP3A4 pPXRE in EMSAs but is unable to activate transcription in transient transfection assays (data not shown).
A scan of promoters in the human genome with Model 4 identifies several sites within the regulatory regions of genes not previously recognized to be regulated by PXR. One example is the CASP10 gene promoter, which contains a cluster of three sites with R i values of 14.13, 14.66, and 18.69 bits between 7 and 8 kb upstream (Fig. 3d). Caspase 10 is an initiator caspase that initiates apoptosis following proteolytic cleavage in response to the binding of death receptors to ligands and is regulated transcriptionally as well as post-translationally (29). The 18-bit site (Table I) was tested to determine whether (and how well) it was bound by PXR/RXR heterodimers. This site competed effectively for binding of PXR/RXR, however it had an apparent affinity 6.2-fold less than that observed with the CYP3A4 pPXRE (Table I).
To determine whether the novel PXR/RXR binding sites identified with the model, as well as the predicted optimal PXR/RXR binding site, could modulate transcription, luciferase activity was measured in HepG2 cells transiently transfected with reporter constructs containing two copies of the PXREs upstream of the SV40 basal promoter of the pGL3-promoter. Rifampin (10 M) induced luciferase activity in cells transfected with constructs containing the pPXRE from the CYP3A4 gene promoter, the distal PXRE from the CYP2B6 gene promoter, the 18-bit site from the CASP10 gene promoter, and the optimal PXR/RXR binding site (Fig. 5a). Rifampin did not affect luciferase activity in cells transfected with the pGL3-promoter vector in the absence of any upstream PXREs. Induction of transcription through the PXR/RXR binding site from the CASP10 gene promoter was independent of orientation (sense or antisense) relative to the start of transcription, suggesting that it may act as a true enhancer element. In contrast, rifampin had no effect on luciferase activity in cells transiently transfected with reporter constructs containing two copies of the PXR/RXR binding sites from the UGT1A3 or UGT1A6 gene promoters, implying that PXR/RXR binding to these sequences alone is not sufficient to drive induction of these genes.
Real-time RT-PCR was used to determine the effects of rifampin on endogenous CASP10 mRNA expression in HepG2 cells. Exposure to 10 M rifampin (for 0.5 and 6 h) transiently decreased CASP10 mRNA relative to vehicle-treated cells up to 2-fold. However, longer treatment (48 h) increased expression of CASP10 2-fold relative to Me 2 SO alone (Fig. 5b). These data suggest that CASP10 may be a target for PXR-mediated regulation. Collectively, the EMSA and transient transfection experiments demonstrate the potential of an information theorybased model in identifying regulated genes that have not been previously recognized as targets of PXR/RXR ligands and to locate elements within the gene promoters to which PXR/RXR heterodimers may bind and initiate transcription.

DISCUSSION
Recognition of transcription factor binding sites has relied on identification of conserved genetic elements resembling consensus sequences in promoters. The bias toward consensus sequences in published transcription factor binding data (for review, see Ref. 30) contrasts with the near-gaussian distributions of R i values seen in other models, i.e. natural splice sites (31,32). This distinction reflects differences between the processes involved in ascertaining splice sites and transcription factor binding sites. Although exon boundaries objectively define the coordinates of splice sites, discovery of transcription factor binding sites has been deduced by experimental evaluation of considerably smaller numbers of factors that bind upstream of regulated genes. Homology-based evaluations of promoter sequences of genes regulated by the same transcription factors have led to over-representation of consensus sequences in public databases. Furthermore, identification of binding sites by selective evolution of ligands by exponential enrichment (SELEX) and related techniques inherently selects strong binding sites (33), and weak binding events are often downplayed in the published literature as experimental artifacts (34).
As a result, many existing sources of transcription factor binding site data do not adequately describe the range of natural variation present in such sites. Commonly used promoter prediction software does not typically reveal this underlying deficiency (35), whereas this bias is clearly evident from information theory-based models. Furthermore, boundaries of sites developed from consensus sequences are often arbitrarily defined (e.g. TRANSFAC, www.biobase.de) and tend to ignore the information from weakly conserved positions adjacent to core recognition sequences. By contrast, information analysis optimally aligns sets of binding sites by minimizing uncertainty, and the contributions of all nucleotide positions where information exceeds background noise are represented.
We have developed an information theory-based model of the PXR responsive element beginning with fifteen previously reported binding sites for PXR/RXR and subjecting this initial model to three rounds of iterative refinement by adding naturally occurring and synthetic binding sites (Figs. 1 and 2). Model 4 is the result of the alignment of 48 binding sites, and although some bias toward strong sites remains, the strengths of sites in the model are normally distributed across the range of R i values represented. Furthermore, Model 4 is biased toward a DR4, most likely because of the prevalence of DR4 sites versus DR3 or ER6 binding sites in the initial alignment of fifteen sites. It is conceivable that certain DR3, ER6, or IR0 sites may not be identified as PXR/RXR binding sites or that their strengths may be incorrectly estimated, a limitation of the current model. A bipartite model of PXR/RXR binding sites is currently under development to accommodate variable spacing between the two half-sites (36).
In Model 4 of the PXR/RXR binding site, the 5Ј half-site is less well conserved than the 3Ј half-site. Previous studies of other RXR heterodimers have shown that RXR occupies the upstream half-site, whereas its partner occupies the downstream half-site on repeats separated by 2-5 bp (37-40). The  increased information of the 3Ј half-site may direct, in part, specificity of the element to PXR/RXR heterodimers, whereas the decreased information of the 5Ј half-site may reflect the requirement for RXR to recognize a wider variety of sequence motifs through its dimerization with numerous other partners.
Surprisingly, validated moderate to high affinity sites in the promoters of the UGT1A gene family members did not confer responsiveness to rifampin in transient transfection assays in the HepG2 cell line. Although we cannot exclude the possibility that these sites may actually be recognized by the closely related constitutive androstane receptor (NR1I3) that also heterodimerizes with RXR, these observations may instead indicate that binding alone may be insufficient for transactivation by PXR/RXR. Ligand-induced interactions with coactivators, such as SRC-1, or other transcription factors may be required (4,41,42). For example, interactions between PXR or the constitutive androstane receptor and hepatocyte nuclear factor 4␣ are required for basal and inducible expression of CYP3A4 (43). This is consistent with the possibility that PXR binding sites with moderate amounts of information may require such interaction(s) with coactivators or other trans-acting factors to initiate transcription.
R frequency reflects the amount of information required to distinguish a binding site from all possible sites in the genome and is determined from the size of and number of sites in the genome (44,45). For Model 4, R frequency (28) is 13.1 bits, which falls within the model variance. However, the R sequence value for Model 4 (14.43 bits) predicts 139,038 sites to be found in the human genome. R frequency is often similar to R sequence (28), because the transcription factor state space is constrained for DNA binding proteins that operate on the genome in which they are encoded (46). However, the genome scan with Model 4 detected 355,831 potential PXR/RXR binding sites, with R i Ͼ 10 bits within 10 kb (upstream and downstream) of the transcription start sites. Possible explanations for this discordant result include: (a) lack of specificity for detecting functional PXR/RXR binding sites, (b) potential overlap and co-recognition with other related nuclear receptor recognition sequences, or (c) R frequency is underestimated, because only a subset of the genome is in fact accessible to PXR/RXR.
The optimal or consensus PXR/RXR binding site predicted by the model does not occur in the human genome, consistent with previous genome-wide analyses of other types of binding sites (22). The strongest PXR/RXR binding site in the human genome identified by Model 4 contains 22.96 bits of information, less than the 25.24 bits of information in the optimal or consensus site predicted by information theory. This may be biologically significant for the regulation of gene expression following xenobiotic exposure and, indeed, more relevant for transcription factor-DNA interactions in general. For example, PXR rapidly induces gene expression in the presence of a xenobiotic challenge but equally important must be the ability to switch off the response after the challenge is resolved. PXR/ RXR heterodimers could bind and dissociate from their cognate binding sites through intermediate strength protein-DNA interactions of PXREs, with R i Ϸ R sequence . By contrast, it is less plausible that the protein would be easily displaced by basal transcription factors to relieve strong, high affinity interactions at PXREs resembling consensus sequences (R i Ͼ Ͼ R sequence ). This contention is supported by the difference in binding affinity between the strongest identified binding site in the genome and the consensus sequence.
The most frequently occurring PXR/RXR binding site in the human genome is contained within an L1 retrotransposon sequence. It has been proposed that plant-animal warfare fueled a dramatic increase in the number of cytochrome P450 isoforms ϳ400 million years ago (47), and it is tempting to speculate that random insertion of retrotransposons containing PXRE-like binding motifs throughout the genome may have conferred a selective advantage to individuals in the population when insertion occurred upstream of genes capable of minimizing the  Information Theory-based PXR Binding Site Model adverse consequences of potentially harmful xenobiotics from dietary sources (i.e. via biotransformation and enhanced elimination). Genomic insertion of transposable elements conferring a selective advantage for the host has been demonstrated in the apolipoprotein(a) gene promoter, which contains an enhancer derived from L1 sequences (48). Excluding the PXREs found in L1 elements, the human genome scan also identified 1825 elements within the 5Ј regions of genes not previously recognized as PXR targets as well as within promoters of unannotated genes deduced from mRNAs mapped onto the genome sequence. Some potential targets of PXR identified by the model, including coproporphyrinogen oxidase and aldehyde dehydrogenase 1A1, have recently been described to be regulated in mice overexpressing human PXR (49). This illustrates the potential of transcription factor binding site models to provide information complimentary to data from microarray studies. In addition, the genome scan with the PXR binding site model suggests that pathways affected by the PXR-mediated drug response may extend beyond drug biotransformation and transport to include transcription, apoptosis, signal transduction, and cell-cycle control, as shown in Table II (for all gene ontologies, see Supplementary Table 2).
Evaluation of an 18-bit candidate PXR/RXR binding site from the CASP10 gene promoter provides the basis for new insights into the scope of PXR/RXR-mediated adaptive responses. This site functioned as an enhancer element in transient transfection assays, and time-dependent changes in CASP10 mRNA expression, which decreased in the CASP10 message early after rifampin exposure and induction of CASP10 mRNA after long term rifampin exposure, is consistent with suppression of apoptosis, contributing to the increased liver mass observed in rodents following treatment with chemical inducers. Furthermore, the data suggest that in a genomic context, PXR/RXR-mediated regulation of CASP10 gene expression is likely to be quite complex. Regulation of genes involved in apoptosis suggests a role for PXR beyond transport and metabolism in the cellular response to xenobiotics and may, for example, represent the response to xenobiotic challenge when all other adaptive mechanisms have been overwhelmed.
Information analysis identified presumptive PXREs (with R i Ͼ R sequence ) in the promoters of 1236 unannotated genes. As an example, a 19-bit site (chrX:118,291,236 -118,291,259) was associated with a transcript that did not appear to be in close proximity to a previously described human gene. Upon closer inspection, genes with strong similarity to the human sequence were present in several mammalian species, the closest reported ortholog being from Sanguinus oedipus, a new world monkey. The putative PXR/RXR binding site is located 48 bp upstream of the matching mRNA sequence (GenBank TM accession number AF142571), which is likely to be an intracellular vitamin D binding protein, a particularly intriguing finding given the close similarity between PXR (NR1I2) and the vitamin D receptor (NR1I1).
Transcription factor binding site models using information theory, such as the PXR binding site model described here, can be used to predict response elements in established, regulated genes as well as novel transcription factor targets. Furthermore, development of additional transcription factor binding site models will aid in the identification of cooperative interactions between trans-acting transcription factors and identify signatures of coordinately regulated genes. Integration of in silico approaches with the results of gene expression array analyses has the potential to dissect hierarchies of genes that are targets of the PXR genomic response as well as other trans-acting elements that modulate gene expression.