Remote Control of Gene Expression*

The elucidation of a growing number of species' genomes heralds an unprecedented opportunity to ascertain functional attributes of non-coding sequences. In particular, cis regulatory modules (CRMs) controlling gene expression constitute a rich treasure trove of data to be defined and experimentally validated. Such information will provide insight into cell lineage determination and differentiation and the genetic basis of heritable diseases as well as the development of novel tools for restricting the inactivation of genes to specific cell types or conditions. Historically, the study of CRMs and their individual transcription factor binding sites has been limited to proximal regions around gene loci. Two important by-products of the genomics revolution, artificial chromosome vectors and comparative genomics, have fueled efforts to define an increasing number of CRMs acting remotely to control gene expression. Such regulation from a distance has challenged our perspectives of gene expression control and perhaps the very definition of a gene. This review summarizes current approaches to characterize remote control of gene expression in transgenic mice and inherent limitations for accurately interpreting the essential nature of CRM activity.

The elucidation of a growing number of species' genomes heralds an unprecedented opportunity to ascertain functional attributes of non-coding sequences. In particular, cis regulatory modules (CRMs) controlling gene expression constitute a rich treasure trove of data to be defined and experimentally validated. Such information will provide insight into cell lineage determination and differentiation and the genetic basis of heritable diseases as well as the development of novel tools for restricting the inactivation of genes to specific cell types or conditions. Historically, the study of CRMs and their individual transcription factor binding sites has been limited to proximal regions around gene loci. Two important by-products of the genomics revolution, artificial chromosome vectors and comparative genomics, have fueled efforts to define an increasing number of CRMs acting remotely to control gene expression. Such regulation from a distance has challenged our perspectives of gene expression control and perhaps the very definition of a gene. This review summarizes current approaches to characterize remote control of gene expression in transgenic mice and inherent limitations for accurately interpreting the essential nature of CRM activity.
Development and differentiation of mammalian cells require myriad signal transduction pathways converging on the genome to effect exquisite control of gene expression from nearly 2,000 transcription factors binding hundreds of thousands of cis-acting regulatory elements. Only a subset of transcription factors is enlisted to transcribe a portion of the genome (i.e. a cell's transcriptome) necessary for normal homeostasis in each of the Ͼ250 distinct mammalian cell types. Perturbations in a cell's normal transcriptome underlie many diseases, and as promoters are increasingly analyzed, more and more promoter polymorphisms are being linked to an array of human disorders (1). Consequently, there has been tremendous interest in illuminating the transcriptional circuitry governing both normal and abnormal gene expression. Regulatory elements controlling gene expression were identified initially nearby a gene's start site of transcription, thus serving to define the boundaries of the first mammalian gene promoters and proximal enhancer regions (2)(3)(4). These and other studies paved the way for methods of appraising these proximal sequences in transgenic mice using bacterial reporter genes. It has become increasingly clear, however, that control of gene expression may proceed via cis-acting regulatory elements residing considerable distances from a gene's core promoter region (5). Such elements are often embedded within highly conserved non-coding sequences, collectively referred to as cis regulatory modules (CRMs). 2 Finding and testing these remote control modules was virtually impossible until advancements were made to facilitate the sequencing and annotation of the human and mouse genomes. Here, we briefly discuss innovations in genomics for identifying and evaluating otherwise undetectable CRMs and their regulatory elements and the limitations associated with the study of these non-coding sequences.

Remote Control of Gene Expression Revealed by Artificial Chromosome Transgenic Mice
The conventional method for analyzing CRMs involves an "out of genomic context" assay wherein a portion of a gene's promoter or enhancer region is excised from its normal genomic landscape and cloned upstream of a non-mammalian reporter gene (e.g. firefly luciferase) in a plasmid-based vector having limited cloning capacity, typically less than 10 kilobases of DNA. Although these assays offer a facile approach to define the transcription factors that bind regulatory elements and modulate gene expression, they are very limited in terms of revealing physiological activity in a complex animal. Thus, it is often compulsory to evaluate the same sequences in a transgenic animal such as the mouse. Note that an important prerequisite for such in vivo studies is a complete understanding of an endogenous gene's expression profile during embryonic and postnatal development. Many conventional in vivo transgenic studies have clearly demonstrated the important concept of separable regulatory elements controlling spatiotemporal patterns of gene expression. For example, in its native genomic context the expression of the endogenous SM22␣ gene occurs in essentially all smooth muscle cell lineages. When taken out of its normal genomic context, however, the SM22␣ proximal promoter is active primarily in arterial smooth muscle cells with little or no activity in such smooth muscle-rich tissues as stomach, uterus, and bladder, indicating elements controlling expression of SM22␣ in these tissues reside outside of the region analyzed (6,7). When a single regulatory element (CArG box) is mutated, virtually complete loss in reporter gene activity is observed, suggesting a critical function for the CArG box in mediating SM22␣ promoter activity in arterial smooth muscle (8). Similar incomplete activity profiles have been reported for the Desmin and Csrp1 gene promoters (9,10). In contrast, some in vitro documented promoter/enhancers display little or no activity when cloned in a conventional LacZ plasmid for transgenic mouse studies (11). These few examples imply that proper spatiotemporal expression of a gene requires modular CRMs that may reside remotely from the core promoter region. One important innovation of the human genome project was the development of artificial chromosomes that are large capacity cloning vectors harboring hundreds of kilobases of genomic DNA. Such vectors offer a powerful means of capturing all CRMs and their regulatory elements controlling complete spatiotemporal expression of a gene, thus avoiding the incomplete activity profiles observed with most plasmid-based transgenic constructs (12). Distal regulatory elements within CRMs include not only enhancers but also silencers and an array of so-called boundary elements (e.g. insulators and locus control regions) that establish important points of transcriptional control through the establishment of either transcriptional activation or repression complexes (13). Current estimates of the total number of CRMs are in the hundreds of thousands with many occurring in gene-poor regions of the genome (14). This staggering number of potentially important CRMs in mammalian genomes will require more sophisticated methods of analysis than traditional plasmid-based transgenic mouse studies.
Three artificial chromosome vectors have been designed to facilitate the study of gene regulation: yeast artificial chromosomes (YACs), P1 phage artificial chromosomes (PACs), and bacterial artificial chromosomes (BACs) (12). These cloning vectors were used initially to physically map genomes, to complement mutations in the mouse genome, and to model human diseases (12). For purposes of this review, we confine our discussion to artificial chromosomes used to define CRMs and their regulatory elements that control gene expression in the mouse though this technology can certainly be applied to other animals as well (15). As of this writing, ϳ200 mammalian genes have been studied with artificial chromosomes (supplemental table). The first artificial chromosome mouse utilized a YAC containing the human Tyrosinase gene, which was shown to rescue albinism in mice (16). Because the transgene was of human origin, these and other early YAC transgenic mouse studies designed human probes to distinguish the endogenous mouse gene from the human transgene. The emergence of BACs and PACs, which are much easier to manipulate and handle than YACs, led to their widespread use for the study of gene regulation in transgenic mice (supplemental table) though these vectors are more limited in cloning capacity (100 -300 kb). As with YACs, early BAC/PAC studies relied on humanspecific probes for analysis and in some cases species-specific antibodies (11). To simplify the analysis of artificial chromosome transgenic mice, genetic engineering studies evolved to integrate a convenient reporter (e.g. LacZ) into the artificial chromosome (17). Dozens of such studies have demonstrated the utility of this approach in defining embryonic and adult patterns of transgene expression (supplemental table). Limitations of using reporters such as LacZ in artificial chromosomes include their susceptibility to genomic silencing and perturbations in genomic landscapes that could confer spurious promoter/CRM activities. For this reason, the most accurate approximation of a gene's complete set of CRMs is to evaluate the locus in its proper chromosomal landscape with no integration of a reporter gene. Of course, this may not always be possible inasmuch as some genes (e.g. Dystrophin) are too large to be accommodated in any artificial chromosome. The latter however could be analyzed with clever methods in BAC transgenesis that have been used to define remote CRMs (below).
Artificial chromosomes have been instrumental in demonstrating the existence of distal CRMs that are otherwise impossible to capture by conventional plasmid-based methods. Using the examples above in which conventional plasmid-based vectors revealed partial or no promoter activity in smooth muscle tissues, subsequent use of BACs carrying human SM22␣ (18), Desmin (19), or Calponin (11) recapitulated the spatiotemporal expression of each respective endogenous mouse gene. The implication of these findings is that the BAC clones harbor all CRMs controlling complete expression of each gene. In some cases, the distance of the regulatory region from the core promoter has been found to extend beyond the confines of an artificial chromosome. For example, a 350-kb YAC containing the PDGFR␣ gene fails to completely rescue a null phenotype because the transgene is not expressed in pulmonary bronchiolar smooth muscle (20). This implies the existence of a bronchiolar smooth muscle-specific remote control element that lies outside of the boundaries of the 350-kb YAC clone. YACs of 600 and 350 kb were used to map distal upstream regions controlling expression of the SOX9 transcriptional regulator, deletions of which impart a rare bone defect known as campomelic dysplasia (21). Three distal enhancers controlling Gata2 gene expression in the urogenital system were discovered using a novel "BAC-trapping" approach wherein a series of BACs covering 1 megabase of DNA around the Gata2 locus were systematically analyzed in a LacZ-containing BAC vector (22). A similar approach, referred to as "BAC scanning," has been developed to survey large genomic intervals for functional, tissue-specific CRMs (23). Note that either of these BAC-scanning methods would be an ideal approach to evaluate CRMs across very large genes. These powerful approaches do have limitations in that they often employ LacZ as a reporter for the facile detection of CRM activity and/or the removal of the CRM from its native genomic context. In the latter instance, it is possible that CRM activity may be associated with another gene that could lie remote from the region under study and have nothing to do with the expression of the primary interest gene. In this context, a CRM controlling sonic hedgehog (Shh) expression is found nearly 1 megabase away from its promoter region in an intron of a gene (LMBR1) 6 genetic loci removed from Shh (24). Another limitation for assaying an out of genomic context CRM is they may adapt spurious activities based on the close proximity to sequences that are ordinarily remote from the CRM site. There is growing support for chromosomal looping models of gene transcription wherein distal CRMs adapt specialized structures involving protein-protein contacts that establish multiple levels of control for transcription, the spatial orientation of the transcribed gene, and its post-transcriptional processing ( Fig. 1) (25). Such structures may also confer repressive effects on adjacent genes, the expression of which may be unnecessary in certain contexts. Removing distal CRMs from their native genomic context could alter these and other important contacts, thus yielding inaccurate activities. On the other hand, small genes (less than 100 kb) contained in a single artificial chromosome are amenable to systematic analysis of putative CRMs by artificial chromosome modifications (see below). Notwithstanding the above limitations, there have been great strides in defining distal CRMs driving tissue-specific expression of a growing number of genes, thus expanding our appreciation of the complexities of gene expression control and providing a foundation for further functional analyses. As artificial chromosome mice were being generated, another important innovation stemming from the genomics revolution was rapidly being deployed to further identify CRMs that remotely control gene expression.

Remote Control of Gene Expression Revealed by Comparative Genomics
An initial hurdle to studying gene regulation with artificial chromosomes was pinpointing the location of CRMs and their associated regulatory elements in a large genomic landscape. In the past, various labor-intensive, wetlab assays were used to identify candidate CRMs controlling gene expression followed by validations with reporter constructs introduced into cells or mice. Now, with multiple vertebrate genomes in hand, it is preferable to search for CRMs over extended genomic landscapes by comparing orthologous sequences followed by validations with wet-lab assays (26). Numerous internet-available algorithms are available to survey genomes for CRMs (27) and test them either in isolation or in the context of an artificial chromosome transgenic mouse. For example, VISTA (wwwgsd.lbl.gov/vista/) (28) and PIPMaker (bio.cse.psu.edu) (29) were used to compare a 1-megabase YAC containing the human interleukin cluster of genes on 5q31 with the homologous region in the mouse. This analysis revealed a 400-bp regulatory module controlling interleukin-5 gene expression from a distance of 120 kb (30,31). Finding this module, which is two genes removed from the interleukin-5 locus, was not intuitive and might have gone undetected were it not for the comparative genomics approach afforded by VISTA and PIPMaker. The 400-bp interleukin-5 regulatory module provides yet another example of how genes are controlled by regulatory modules residing great distances from core promoters. Fig. 2 illustrates a VISTA plot of Myocardin (32), a cofactor functioning as a molecular switch for smooth muscle differentiation (33) and repressor of the transformed cancer phenotype (34). This cofactor's transcriptional regulation appears to require a distal CRM located ϳ40 kb upstream of the core promoter (35). Comparative genomics identified a critical MEF2 site necessary for CRM activity in an out of genomic context assay (35). It will be of great interest to ascertain whether this MEF2 site is necessary for Myocardin expression in the context of a BAC clone.
Recently, Pennacchio and colleagues (26,36) coupled comparative genomics to expression array data and transcription factor binding site analyses to identify some 7,000 candidate CRMs that drive tissue-specific gene expression. These studies are the basis for a large scale transgenic enhancer activity assay that will systematically test highly conserved blocks of sequence in transgenic mice (see enhancer.lbl.gov) and will provide critical leads for subsequent analysis in artificial chromosomes because the limitations again are that the CRMs are studied out of their native genomic context and may therefore only reveal a portion of information about the full spectrum of regulatory elements underlying a gene's expression profile. Individual cisacting regulatory elements within CRMs may also be identified by computational means though the reliability of finding these requires substantial insight into the transcription factor binding affinities and sequence plasticity across the element of interest (27,37). Limitations of comparative genomics for dis- covery of CRMs or individual cis regulatory elements include the inability to define coregulators that do not contact DNA directly and the insensitivity to detect species differences in gene expression control because of sequence differences in CRMs.
Ultimately, the importance of distal regulatory elements should be determined by knocking them out in their native genomic context. Several proximal regulatory element knockouts have been generated in mice including the Hif1␣ response element of VEGF-A (38) and a CArG box upstream of the Telokin gene (39). A distal CRM of Shh, located nearly 1 megabase away from its promoter, has also been deleted in its native genomic context and shown to be essential for normal Shh expression (40). On the other hand, there have been some regulatory element knockouts with unpredictable outcomes. For example, an enhancer upstream of the Tal1 transcription factor, shown previously to be critical for expression with transgenic mouse studies, was found to be dispensable for expression when knocked out in its native context (41). An important innovation that has been of significant value in approximating conventional knockouts of regulatory elements is the ability to perform deletions or point mutations of such sequences in the context of an artificial chromosome. Several methods have been developed to introduce a variety of reporters (e.g. green fluorescent protein) or bacterial genes that can facilitate artificial chromosome transgenic analyses (supplemental table) (42). For example, a kanamycin cassette flanked with flp recombinase target sites was used to target an upstream region of the Sclerostin gene in a 170-kb BAC, and deletion of this region led to a severe decrease in Sclerostin expression (43). Similarly, by combining artificial chromosomes with comparative genomics and homologous recombination, researchers have demonstrated a critical role for distal enhancers in the control of a cluster of odorant receptors (44), the expression of interferon ␥ (45), and the cell-restricted expression of the Tyrosinase gene (46). Although the incorporation of such alterations in an artificial chromosome is rather straightforward, great care is needed to ensure that any changes are restricted to only the target site of interest. As with conventional embryonic stem cell-based knockouts, deleting a regulatory element in artificial chromosome transgenic mice can yield unanticipated results. For example, a previously demonstrated silencer element for ⑀-globin gene expression was subsequently shown to have little influence on ⑀-globin expression when mutated in the context of a 213-kb YAC (47). The latter emphasizes redundancies that exist for correct gene expression and the importance of combinatorial interactions that must occur across modular CRMs that may be tens or hundreds of kilobases apart. As such, conventional LacZ reporter mouse studies demonstrating essential roles for single regulatory elements in the control of gene expression (e.g. CArG box in SM22␣ (8) or MEF2 site in Myocardin (35)) may not yield similar results when mutated in the context of an artificial chromosome or their endogenous genomic position.

Conclusion
The genomics revolution has yielded a massive amount of data with a conserved non-coding sequence estimated to be more than twice that of all coding sequences (27). A significant portion of the non-coding sequence comprises CRMs that control gene expression locally or remotely from their target genes. The development of artificial chromosomes and an array of computational tools for comparative genomics has facilitated the identification and functional validation of many CRMs, including those residing phenomenal distances from their target gene. Given the importance of transcriptional control of gene expression in development (e.g. stem cell fates) and disease (e.g. most undefined functional polymorphisms remaining in the genome are likely associated with defective CRM activity), elucidating all CRMs and regulatory elements will be a major research goal for the foreseeable future. An important consideration as we move forward in this endeavor will be analyzing these non-coding sequences in their proper genomic context.