Modular Organization of Phylogenetically Conserved Domains Controlling Developmental Regulation of the Human Skeletal Myosin Heavy Chain Gene Family*

The mammalian skeletal myosin heavy chain locus is composed of a six-membered family of tandemly linked genes whose complex regulation plays a central role in striated muscle development and diversification. We have used publicly available genomic DNA sequences to provide a theoretical foundation for an experimental analysis of transcriptional regulation among the six promoters at this locus. After reconstruction of annotated drafts of the human and murine loci from fragmented DNA sequences, phylogenetic footprint analysis of each of the six promoters using standard and Bayesian alignment algorithms revealed unexpected patterns of DNA sequence conservation among orthologous and paralogous gene pairs. The conserved domains within 2.0 kilobases of each transcriptional start site are rich in putative muscle-specific transcription factor binding sites. Experiments based on plasmid transfection in vitroand electroporation in vivo validated several predictions of the bioinformatic analysis, yielding a picture of synergistic interaction between proximal and distal promoter elements in controlling developmental stage-specific gene activation. Of particular interest for future studies of heterologous gene expression is a 650-base pair construct containing modules from the proximal and distal human embryonic myosin heavy chain promoter that drives extraordinarily powerful transcription during muscle differentiation in vitro.


From the Department of Surgery, University of Pennsylvania Medical System, Philadelphia, Pennsylvania 19104
The mammalian skeletal myosin heavy chain locus is composed of a six-membered family of tandemly linked genes whose complex regulation plays a central role in striated muscle development and diversification. We have used publicly available genomic DNA sequences to provide a theoretical foundation for an experimental analysis of transcriptional regulation among the six promoters at this locus. After reconstruction of annotated drafts of the human and murine loci from fragmented DNA sequences, phylogenetic footprint analysis of each of the six promoters using standard and Bayesian alignment algorithms revealed unexpected patterns of DNA sequence conservation among orthologous and paralogous gene pairs. The conserved domains within 2.0 kilobases of each transcriptional start site are rich in putative muscle-specific transcription factor binding sites. Experiments based on plasmid transfection in vitro and electroporation in vivo validated several predictions of the bioinformatic analysis, yielding a picture of synergistic interaction between proximal and distal promoter elements in controlling developmental stage-specific gene activation. Of particular interest for future studies of heterologous gene expression is a 650-base pair construct containing modules from the proximal and distal human embryonic myosin heavy chain promoter that drives extraordinarily powerful transcription during muscle differentiation in vitro.
Throughout metazoan evolution, the range of molecular mechanisms used to achieve functional diversity in muscle partially reflects the anatomic complexity of the species under consideration. In vertebrates, the maximal contractile speed and energy utilization rate of individual muscle fibers is largely controlled by the structure of the motor protein myosin (reviewed in Ref. 1). The conventional myosins of striated muscle are hexameric proteins composed of paired trimers of myosin heavy chain (which contains the ATP-splitting motor domain) and one each of the essential and regulatory light chains. The human genome has at least 11 distinct striated myosin heavy chain (MyHC) 1 genes (2), six of which are abundantly expressed in skeletal muscle and are tandemly linked at a single locus on chromosome 17 (3)(4)(5). The members of the human skeletal MyHC locus have 38 coding exons and two 5Ј noncoding exons. Unlike striated MyHC isoform diversification in Drosophila, which reflects a complex pattern of alternative exon splicing (6,7), the process in vertebrates relies on sequential activation of a family of distinct genes, each encoding a single MyHC (8). By virtue of the extraordinary abundance of these proteins (35% of muscle protein), a large body of experimental data on MyHC isoform switching was assimilated before isolation of the encoding genes (9).
During embryonic myogenesis in mammals, the transcriptionally 5Ј-most gene in the skeletal MyHC locus is activated first. During fetal muscle development, the perinatal gene is selectively activated in a large proportion of the muscle cells with progressive down-regulation of the embryonic gene until birth. During subsequent postnatal growth, the perinatal MyHC is progressively replaced by the three "type II" (adult fast twitch) MyHCs. Throughout adult life, reversible transitions are possible in response to cycles of degeneration and regeneration or alterations in the hormonal environment, innervation, and pattern of locomotive recruitment of individual muscle fibers (1,10). A sixth gene at the skeletal MyHC locus is expressed only in selected extraocular and laryngeal muscle fibers (11). The order of sequential activation during development contrasts with the physical order of the linked genes: 5Ј-embryonic, IIa, IId/x, IIb, perinatal, extraocular-3Ј. A seventh MyHC abundantly expressed in type I (slow twitch) skeletal muscle fibers is encoded by the physically unlinked cardiac ␤ MyHC. At all times, the stoichiometry of myofibril assembly must be maintained such that inactivation or down-regulation of one skeletal MyHC gene must be precisely offset by activation or up-regulation of another.
Two exceptional attributes of the mammalian skeletal MyHC locus, size and tandem gene number, have limited the study of transcriptional regulatory mechanisms within this gene family. Predating the modern era of general information about transcription factor binding sites, studies of a 1.4-kb region of genomic DNA upstream from the rat embryonic MyHC gene suggested a complex pattern of regulation by counteracting cis-elements (12). Studies of a large number of other single copy genes expressed during myogenesis have more recently defined roles and binding specificities for several tran-* This work was supported in part by grants (to H. S.) from the Association Française contre les Myopathies (AFM), the Muscular Dystrophy Association of America, NIAMS (National Institutes of Health (NIH)), NINDS (NIH), and the Department of Veterans Affairs. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18  scription factors, providing a basis for general models of promoter and enhancer function during commitment and terminal differentiation. Transcription factors of the MyoD/MRF and MEF-2 families play pivotal and synergistic roles in the activation and maintenance of the myogenic differentiation pathway (13,14). Although the precise mechanisms by which these factors achieve transcriptional activation remain unclear, there is increasing evidence for a central role of histone modification and chromatin remodeling. In addition to MRF and MEF-2, numerous ubiquitous and muscle-specific transcription factors have been found to play important roles in gene upregulation during terminal myogenesis. A partial list reflecting the extent of characterization includes AP1, AP2, GR, Oct-1, SRF, Sp1, TEF, and TR (15)(16)(17)(18)(19)(20). Factors traditionally associated with other cell lineages (e.g. GATA and hematopoeisis) have also been found to play important roles in the expression of some muscle-specific genes (21). For all of the factors listed above, information about binding specificities has been incorporated into publicly available nucleotide weight matrices (22).
The sequence specificity of transcription factors and the recruitment of these proteins into multisubunit complexes dictate intriguing patterns of evolutionary sequence conservation in noncoding DNA. In recent years, a number of World Wide Web-based tools have been developed to facilitate the comparative analysis of DNA sequence emerging from the various genome projects. Although the publicly available murine genome sequence currently consists of highly fragmented files in draft form, islands of synteny with the human genome allow reconstruction of orthologous gene alignments that expedite the prospective identification of transcriptional control regions. As currently implemented, "percentage identity plot" analysis provides a low resolution snapshot of large genomic regions (23), whereas "Bayesian phylogenetic footprinting" allows a detailed look at regions of up to 2 kb (24). The latter approach has been applied to other muscle-specific promoters with the finding that the majority of relevant transcription factor binding sites are concentrated within the small percentage of total sequence that has the highest probability of alignment (i.e. meets the most stringent criteria for conservation). The complementary use of transcription factor matrices to screen phylogenetically conserved DNA sequence for high affinity binding sites has the capacity to yield a detailed annotation of putative cis-acting regulatory domains. Hypotheses generated on the basis of this annotation can provide a useful basis for the experimental study of gene regulation in vitro and in vivo.
In the present study, we use draft murine genome sequence from several fragmented draft files, initially identified on the basis of BLAST (25) comparisons with the human genome, to reconstruct a working model for the complete murine skeletal MyHC locus. Each of the six human MyHC promoter regions is analyzed in detail using bioinformatic tools, as exemplified by those cited above, to yield annotated maps reflecting both the level of sequence conservation and the precise location of matches to transcription factor matrices. Regions of strong conservation between the orthologous human and murine MyHC promoters occur in patches scattered throughout at least 2 kb immediately upstream of the transcriptional start sites. Paralogous promoter comparisons reveal surprisingly strong homology within 300 bp of the transcriptional start sites, providing evidence for a conserved proximal "core" promoter domain. Transfection of myotubes in vitro and electroporation of myofibers in vivo are used to further dissect the relevant domains of four of these promoters with the findings that 1) isolated proximal promoter regions can retain developmental stage specificity despite homology to a core consensus sequence and 2) an E box-rich distal region of the human embryonic MyHC promoter dramatically amplifies the in vitro activity of all of the proximal core promoters tested in cis. These findings suggest several testable hypotheses regarding potential short and long range interactions between transcriptional control elements within the skeletal MyHC tandem gene locus.

EXPERIMENTAL PROCEDURES
Computer Analysis-Deduced full-length cDNA sequences derived from an annotated version of the human skeletal myosin heavy chain locus (2) were used to query the high throughput genomic DNA sequence databases accessible through the National Center for Biotechnology Information Homepage (www.ncbi.nlm.nih.gov/BLAST/). Species-specific repeats in the reconstructed human locus were identified using the REPEATMASKER algorithm as implemented on the Washington University server (ftp.genome.washington.edu/cgi-bin/Repeat-Masker). As soon as draft sequences appeared online to confer complete coverage of all six of the genes at the murine locus, we ordered the draft sequence fragments to correspond to the order of the orthologous human sequences. A reconstructed murine sequence draft spanning the entire locus was annotated, initially by using GENSCAN as implemented on the MIT server (genes.mit.edu/GENSCAN) and later by cross-checking with the human annotated sequence to verify exon splice site predictions. We next used sequence from our full-length cDNA clones to prospectively identify the transcriptional start sites for four of the genes at the human locus. These predictions were confirmed and extended to include all six genes by using the Promoter Prediction by Neural Network program as implemented on the Berkeley Drosophila Genome server (www.fruitfly.org/seq_tools/promoter.html). Orthologous regions were easily identified in the annotated murine MyHC locus. Regions upstream of the predicted human and murine transcriptional start sites were analyzed with the several programs in the MacVector package (version 6.0.1; Oxford Molecular Group) as implemented on the Mac OS platform and as cited throughout. In addition, these regions were analyzed using the TESS and Bayesian Phylogenetic Footprint Homepage programs, as implemented on the University of Pennsylvania (www.cbil.upenn.edu/cgi-bin/tess/tess?SEA-FR-Query) and Wadsworth servers (bayesweb.wadsworth.org/cgi-bin/bayes_ align12.pl), respectively. The outputs from the latter analyses were merged by importing the results into Microsoft Word files and reproducing the color coding of the nucleotide sequence to identify phylogenetically conserved regions. The complete files are available on request form the corresponding author. Only the general results have been integrated into the figures in this paper.
Plasmids-All of the promoter constructions were made in the pGL3basic plasmid, a promoterless gene encoding the firefly luciferase (Promega). The different promoters were cloned using the NheI and BglII sites of the pGL3-basic plasmid. For this purpose, all of the promoters were amplified by 30 cycles of PCR using the proofreading Pwo polymerase (Roche Molecular Biochemicals) according to the manufacturer's protocol. All primers were designed to have a 58°C annealing temperature and to contain the NheI site in 5Ј or the BglII site in 3Ј as showed in Table I to allow a unidirectional cloning. The plasmid CMV-␤Gal (CLONTECH) was used for the quantification of transfection efficiency. The hybrid plasmids between the embryonic and the IIb promoters (KNX, KNF) were obtained by the ligation of two PCR fragments through a SalI site introduced into the primer (Table I). The KOF insert was obtained by a recombinant PCR to join both independent PCR fragments.
Cell Culture and Transfection-The C2C12 cell line proliferates at low density in Dulbecco's modified Eagle's medium plus 20% fetal calf serum. The differentiation into myotubes was induced by replacing the proliferation medium by Dulbecco's modified Eagle's medium ϩ 0.5% fetal calf serum ϩ ITS (insulin, transferrin, and selenium from Roche Molecular Biochemicals at 5 g/ml each). These cells were transfected using Fugene6 (Roche Molecular Biochemicals). Briefly, at day 1, myoblasts were plated at 200,000 cells/well in a six-well plate. At day 2, 2 h prior to the transfection, the medium was removed and replaced by differentiation medium. Transfection was achieved by incubating 6 l of FuGene6 in 94 l of Dulbecco's modified Eagle's medium at room temperature. After 5 min, 2 g of the luciferase reporter plasmid and 1 g of CMV-␤Gal plasmid were then incubated for another 15 min. The transfection mixture (100 l) was added to the cell drop by drop. Cells were collected at day 5 for luciferase assay using 200 l of 1ϫ lysis buffer per well according to the luciferase assay system (Promega). We used 10 l of cell extracts for the determination of the luciferase activity. The ␤-galactosidase activity was measured using the chemiluminescent reporter gene assay system for the detection of the ␤-galactosidase (Galacthon-Light, Tropix PE biosystems) with 10 l of cell extracts.
Rat in Vivo Electroporation-Electroporation of rat tibialis anterior muscles was performed as per Ref. 30 shown. When initially downloaded, each file was composed of ϳ15 independent contigs in random order and orientation. B and C, raised relief dot matrix analysis of the regions 8 kb upstream of the human and murine embryonic (B) and IIB (C) transcriptional start sites. Dot height ranges from a low of 40% to a high of 100% sequence similarity using the default DNA data base matrix within the MacVector software package. ChemE-1, -2, and -3 correspond to the conserved domains defined under "Results." Open boxes at the bottom of the plot correspond to repeat domains as annotated in the Human Genome Project. D, standard dot matrix analyses covering 2 kb upstream from each of the following promoters, in the following order: IIa, IId/x, IIb, perinatal. Not shown is the slightly lower homology to the embryonic promoter, truncated because of the computational matrix size limitations for this version of the MacVector program. There was no detectable homology with the extraocular promoter.
cordance with University of Pennsylvania IACUC protocols. Liquid nitrogen frozen muscles were pulverized to powder and solubilized with 1ϫ lysis buffer (luciferase assay system; Promega). After a 5-min incubation and one freeze/thaw cycle, samples were centrifuged, and 20 l of supernatant were processed for luciferase and ␤-galactosidase assays.
Normalization of Reporter Gene Activity and Statistical Analysis-The luciferase results are expressed as the ratio between the luciferase activity and the ␤-galactosidase activity. Furthermore, for each series of transfections, we transfected the promoterless plasmid pGL3-basic for the quantification of background. The normalized luciferase results obtained with the pGL3-basic plasmid are defined as 1. All of the other activities of the series are then expressed as the x-fold trans-activation above the background. Data sets on levels of marker gene expression were analyzed (JMP; SAS Institute, Inc., Cary, NC) to determine arithmetic means, S.D. values, S.E. values, S.E. values of the means, and the probabilities associated with one-way analysis of variance (ANOVA), Student's t test, and the Tukey-Kramer HSD test (i.e. the probability of incorrectly rejecting the null hypothesis about the difference between means among multiple groups within a data set). The Tukey-Kramer HSD provides a conservative calculation of statistical significance in the analysis of intergroup comparisons that minimizes the risk of type I error by increasing the quantile multiplied into the S.E. values to create the least significant difference.

RESULTS
To provide a basis for phylogenetic analysis of putative transcriptional control domains within the skeletal MyHC locus, we periodically screened high throughput DNA sequence databases until a representation of the entire murine locus could be identified in draft form. Three overlapping bacterial artificial chromosomes spanned the entire locus as depicted in Fig. 1A, although the data contained in the relevant accession files was highly fragmented at the time of this analysis. Based on synteny with the human locus, we reconstructed a single file that contained a properly ordered and annotated draft of the murine locus. As expected, coding regions exhibit greater than 90% sequence homology, without gaps. Although the orthologous genes occupy approximately the same relative proportion of the length of the entire locus, the murine locus is only 80% of the size of the human locus. From the standpoint of transcriptional regulation, the regions upstream of the first coding exons of the orthologous and paralogous genes are of greatest interest. We identified transcriptional start sites by comparison with fulllength cDNA sequences or by analyzing homologous intergenic regions for TBP binding site and initiator consensus sequences using Promoter Prediction by Neural Network NNPP (Lawrence Berkley National Laboratory).
An initial screen of 8 kb upstream of each transcriptional start site revealed a pattern exemplified by the orthologous embryonic and IIb promoters in which there is strong conservation interrupted by blocks of nonconserved sequence, which in most cases precisely coincide with the boundaries of speciesspecific repeats as annotated previously using the REPEAT-MASKER algorithm. There are blocks of Ͼ100-bp DNA sequence that attain 60% conservation (note peak height in the raised relief image) out to at least 8 kb upstream of both the human embryonic and IIb MyHC transcriptional start sites (Fig. 1, B and C, embryonic MyHC upstream 8 ϫ 8 kb and IIb MyHC upstream 8 ϫ 8 kb). The conserved regions identified as conserved human embryonic MyHC element (ChemE)-1, -2, and -3 are defined below. With the sole exception of the extraocular MyHC promoter pair, the pattern of conservation within 2 kb of the start sites provided striking evidence for selection against mutation by insertion or deletion. This pattern suggests that the order of and/or the physical distance between transcription factor binding sites spread over at least 2 kb is crucial to the proper developmental regulation of MyHC gene activation/repression. To address the possibility of a shared core promoter structure, we extended the pairwise comparison to include paralogous human MyHC promoters (Fig.  1D). Homology among promoter paralogs is uniquely detected within the most proximal 300 bp, as shown for the IIa, IId/x, IIb, and perinatal genes.
A limitation of this analysis for the identification of transcriptional regulatory elements is its dependence on user-defined parameters such as window size, DNA scoring matrix, and gap penalty. These biases have been largely eliminated through the recent introduction of a Bayesian statistical method based on a Gibbs sampling algorithm (24,26). The relationship between level of sequence conservation and distance upstream from the transcriptional start sites is quantitatively displayed in Bayesian "phylogenetic footprint" plots for five of the promoter pairs (Fig. 2). The z axis in each of the three-dimensional plots indicates the local probability of mouse-human DNA sequence alignment based on a modification of the Bayes block aligner algorithm (24). The overall pattern of gene conservation as a function of distance from the transcriptional start site is similar for the perinatal, IIa, IId/x, and IIb genes. Relative to these four genes, the embryonic gene has a comparatively larger block of conserved sequence in the interval Ϫ600 to Ϫ1000 (ChemE-2) and a smaller block of conserved sequence in the interval ϩ1 to Ϫ300 (ChemE-1). In addition, 4.1 kb upstream of the human embryonic gene transcriptional start site is a 0.5-kb block of extraordinarily conserved noncoding sequence (ChemE-3), a pattern unique among the six genes at this locus. Finally, sequence conservation among the paralogous genes is typified by the human IIa versus IId/x pairing, with four distinct spikes of homology within 300 bp of the transcriptional start.
A position weight matrix-based search for binding sites in a transcription factor data base was filtered to identify candidate protein-DNA interactions in the phylogenetically conserved portion of each sequence pair, as shown schematically above the footprints (Fig. 2, lettering above individual plots). Three groups of transcription factors known to participate in the regulation of muscle-specific gene expression are identified as having high concentrations of sites within these conserved domains: MRFs, MEF-2, and SRF. Additional transcription factors identified by this approach include AP-2, AhR, CCAAT enhancer-binding protein, GR, HEB, Oct-1, RREB-1, Sp1, STAT4, TEF-1, and YY1. As previously demonstrated for other muscle-specific promoters, the concentration of predicted sites is higher in the conserved than in the divergent sequence domains (24). The overall distribution of these sites is not appreciably conserved among the paralogous promoters except FIG. 4. Activity of MyHC promoters in C2C12 myoblasts and myotubes. Luciferase activity was measured in myoblasts maintained in growth medium or in myotubes induced to differentiate for 72 h. The pGL3-basic plasmid is a promoterless plasmid that represents the background. The SV40 enhancer is the plasmid pGL3-control (Promega). Results are expressed as x-fold induction over background. The luciferase activities were normalized with the total amount of protein.
in the most proximal 300 bp of sequence.
We used ClustalW (27) to align all of the proximal promoters for each orthologous gene pair (Fig. 3). This alignment draws attention to four core domains and provides a consensus sequence for these and the intervening regions. A computational analysis of each of the individual sequences and the consensus sequence identifies several "high affinity" matches to the Transfac matrices for MEF-2, MyoD, Oct-1, CCAAT enhancerbinding protein, and TBP, the former of which is outlined on the sequence alignment. In this alignment, the embryonic promoter pair is distinguished from the other four pairs in having the highest affinity MEF-2 site (28) nearest to the transcriptional start site (proximal domain 2 as compared with proximal domain 3 for the others). In fact, this site achieves the highest binding score of any predicted MEF-2 site within 2 kb of an MyHC promoter. Three MEF-2 consensus sequences are identified in the IIa, IId/x, and perinatal promoters. MEF-2 binding is predicted from position weight matrices for the corresponding regions of the other two promoters, but the binding strength falls below the default limits at the most distal site for the embryonic promoter and at the most proximal site for the IIb promoter.
Recapitulating the bioinformatic analysis, five of the six tandemly linked skeletal MyHC promoters exhibit similar overall structures in which focally conserved sequences, rich in putative muscle-specific transcription factor binding sites, are distributed throughout 1.5-2 kilobases upstream of a broadly conserved core promoter domain of 250 -300 bp. We chose a convenient nomenclature to describe the three larger blocks of conserved sequence in the embryonic promoter: conserved human embryonic MyHC elements 1, 2, and 3 (ChemE-1, -2, and -3). These observations provide a basis for interpreting the results of an experimental dissection of proximal and distal promoter elements in vitro and in vivo. As an initial screen for transcriptional activity in vitro, we used large constructs based on cosmid-cloned fragments of three human skeletal MyHC genes (Fig. 4). As expected, the SV40 promoter drove high level marker gene expression in C2C12 myoblasts and myotubes (612-and 146-fold increases over background, respectively). Cis-acting sequences spanning at least 4 kb 5Ј of the embryonic, IId/x, and IIb MyHC gene translational start sites drove marker gene expression in myoblasts at levels barely above background. In myotubes, however, the embryonic MyHC gene construct uniquely increased marker gene expression to a level 160-fold higher than background, exceeding even that of the SV40 control. In contrast, marker gene expression was the same in myoblasts and myotubes for the IId/x and IIb MyHC gene constructs (i.e. just above background). The surprisingly FIG. 5. The proximal 300 bp of each promoter contains regulatory elements that contribute to transcriptional control of the MyHC genes. The region between Ϫ230 and Ϫ174 appears to be especially critical for the activity of the embryonic core promoter in myotubes. The conservative Tukey-Kramer HSD test for all data pairs reveals statistically significant differences (p Ͻ 0.05) for all intergroup comparisons between two groups of constructs: KNS or 2G-4B versus KOK, KOL, KNT, KOS, or KOR. After induction of differentiation for 72 h, C2C12 myotube extracts were assayed for the luciferase activity. The numbers above all of the proximal promoters are bp from the start site. Normalized results are expressed as x-fold induction over the activity of the pGL3-basic promoter (defined as 1). The results are means of at least three independent experiments. The error bars represent the S.E. By one-way ANOVA, F ϭ 31.7 and p Ͻ 0.0001. high marker gene expression from the embryonic construct suggested that the 30-kb fragment encompassed most of the elements necessary for developmental stage-specific transcriptional activation. The relative inactivity of the adult MyHC promoters in this assay recapitulates the activities of the endogenous genes (29) and may reflect the absence of developmental stage-specific activators, the presence of repressors, or a combination thereof.
To address the possibility that repressive elements distal to the broadly conserved core domains were responsible for the inactivation of the adult MyHC promoters, we assayed marker gene expression in C2C12 myoblasts and myotubes using plasmids containing the proximal 300 bp of the embryonic (ChemE-1), perinatal, IId/x, and IIb promoters. The results of this series of experiments (Fig. 5) again showed strong activation in myotubes of the proximal 300-bp portion of the embryonic MyHC promoter and very low (near background) activity of the nonembryonic promoters. Further truncation of the embryonic promoter resulted in an abrupt loss of activity between Ϫ231 and Ϫ174 bp. The Ϫ174 bp construct is predicted to retain the high affinity MEF-2 site between Ϫ140 and Ϫ170, but lose a predicted YY1 site between Ϫ174 and Ϫ200 and a STAT4 site between Ϫ210 and Ϫ230. There is an apparent 15-fold reduction in the activity of the Ϫ300 bp embryonic MyHC promoter construct relative to the 30-kb construct. These results suggest that the extraordinary strength of the intact embryonic MyHC promoter reflects the contribution of cis-acting elements both within and beyond the most proximal 300 bp and that the relative inactivity in myotubes of the larger nonembryonic MyHC promoter constructs at least partially reflects the structure of the most proximal 300 bp.
To further localize the critical elements in transcriptional control of the embryonic MyHC gene, we compared the activities of five additional truncation constructs in vitro. As shown in Fig. 6, there is progressive and dramatic loss of activity with truncation below Ϫ2 kb to Ϫ500 and Ϫ230 bp. The 2-kb construct has the highest marker gene expression in this assay, slightly exceeding that for both the 6.2-kb fragment ending at the transcriptional start site (the only construct containing ChemE-3) and the 4-kb fragment ending at the translational start site (i.e. including introns 1 and 2). These results suggest that putative transcription factor binding sites identified in phyogenetically conserved sequences between Ϫ230 bp and Ϫ2 kb (within and surrounding ChemE-2) greatly contribute to the overall activity of the endogenous promoter during terminal differentiation. Several candidates are shown in Fig. 2, notably the cluster of E boxes between Ϫ700 and Ϫ1500. The E boxes have potentially important differences based on both their se- . Sequences present in the interval from Ϫ2 kb to Ϫ500 bp appear to be critical, since the decrement in activity across this deletion is highly significant (p Ͻ 0.05) when these data sets are compared by the Tukey-Kramer HSD test. In contrast, p Ͼ 0.05 for the pairwise comparisons of data sets for the plasmids 2 kb or greater in length (the first three shown), suggesting that sequences in the first two introns and the ChemE3 are dispensable for transcriptional up-regulation during terminal differentiation. Reporter constructs with 3Ј termini in the beginning of the exon 1 (Ex1) or exon 3 (Ex3) were used to assess the regulatory role of the first two introns. C2C12 cells were transfected as described under "Experimental Procedures." For myoblasts activities, the C2C12 myoblasts were maintained in growth conditions for 48 h after the transfection. The results are means of at least three independent experiments. Normalized results are expressed as a x-fold induction over the activity of the pGL3-basic promoter (defined as 1). By one-way ANOVA, F ϭ 7.43 and p ϭ 0.0001. quence similarity to consensus sequences identified in other promoters and on their proximity to other sites identified through our bioinformatic analysis. The most proximal two of the six E boxes in this interval are predicted to bind the myogenic helix-loop-helix factors with highest affinity, whereas the one at Ϫ860 is adjacent to a predicted MEF-2 site.
To study the role of this phylogenetically conserved region in the muscle-specific activation of this promoter, we constructed a series of internal deletion mutants as shown in Fig. 7. Marker gene expression in a series of C2C12 transfections was only slightly reduced when regions distal to Ϫ1.5 kb and between Ϫ230 and Ϫ700 bp were deleted. The KLX construct retains all six of the E boxes cited above, which we arbitrarily labeled E1 through E6 in order from distal to proximal. When the 800-bp fragment containing these E boxes was further divided into subfragments containing E1-E4 and E3-E6, it was possible to identify E3-E6 as the most important of these cis-acting regions for transcription in C2C12 myotubes. This 512-bp subfragment (Ϫ699 to Ϫ1211) wholly contains and experimentally defines the boundaries of ChemE-2.
To further characterize the pattern of transcriptional activation from the ChemE-2 subfragment, we tested its ability to synergize with heterologous minimal promoters (Fig. 8). The SV40 minimal promoter, activated in cis by a wide range of enhancers, was minimally affected by juxtaposition to the ChemE-2 subfragment in C2C12 myotube transfections. In contrast, all of the MyHC proximal promoters previously tested were further activated by ChemE-2. The relative magnitude of this cis-activation was stronger for the IId/x and perinatal promoters than for the embryonic promoter (37.3-, 31.8-, and 14.1-fold, respectively), and was similar for the IIb (12.3-fold) and embryonic promoters. Nevertheless, the absolute activity of the ChemE-2 subfragment fused to the embryonic minimal promoter was highest by almost 10-fold in this assay. We interpret this result to indicate that the overall developmental stage-and fiber type-specific activity of each of the endogenous MyHC promoters reflects the input of multiple DNA-transcription factor interactions in both the highly conserved proximal promoter and the isogene-specific distal promoter regions. Strong activation by distal elements can partially override weak activation or repression from more proximal elements.
To further dissect the proximal promoter regions, we took advantage of the low level homology between the embryonic and IIb promoters in the construction of chimeric reporter plasmids (Fig. 9). The ChemE-2 subfragment was used in juxtaposition to a series of progressively truncated embryonic proximal promoter fragments with the finding that these serial deletions were associated with the progressive loss of promoter activity. When the deleted regions were replaced by the homologous domains from the IIb promoter, there was neither fur-FIG. 7. A highly conserved E box-rich region defining the ChemE-2 subfragment contains positive cis-acting elements. The ChemE-2 element, an experimentally defined region between Ϫ699 and Ϫ1211 (open box) is able to strongly activate transcription when associated with the proximal promoter (KLX) in myotubes. The deletion of the E boxes E6 and E5 demonstrates their relative importance, since only the results obtained with the KLW plasmid (E6 -E3) are statistically different from the others (p Ͻ 0.05 by Tukey-Kramer HSD). The numbers in circles represent the six core consensus E boxes (CANNTG) found at Ϫ1460, Ϫ1382, Ϫ1151, Ϫ861, Ϫ785, and Ϫ733 respectively. Cell extracts were treated for luciferase and ␤-galactosidase as described under "Experimental Procedures." The results are expressed as means Ϯ S.E. Normalized results are expressed as a x-fold induction over the activity of the pGL3-basic promoter (defined as 1). By one-way ANOVA, F ϭ 7.79 and p ϭ 0.0002. ther loss nor compensatory restoration of activity. The absence of developmental stage-specific transcriptional repression associated with the IIb for embryonic chimeric substitutions suggests the complete absence of inhibitory DNA-protein interactions in this region. In other words, relative to the embryonic deletion constructs, insertional replacement with IIb sequences had essentially no effect on activity, such that the IIb DNA served essentially as neutral spacer sequence in the context of the C2C12 transcriptional milieu. The noninterchangeability of homologous domains from the IIb and embryonic promoters reveals the potential complexity of activating DNA-protein and protein-protein interactions in the proximal promoters.
In vivo, the adult isoforms are highly expressed in the fast twitch muscles of the rodent hind limb. As a simple test of the activity of these promoter constructs in vivo, we used an electroporation protocol to partially augment the efficiency of somatic gene transfer. While less efficient than the use of recombinant vectors that allow binding of the myocyte surface by viral capsid proteins, the plasmid electroporation approach (30) avoids the complicating use of viral DNA sequences in cis with the MyHC promoter sequences. The results of this series of experiments are shown in Fig. 10. We tested the embryonic and IIb constructs with and without the ChemE-2 subfragment at 4 and 7 days, and without ChemE-2 at 21 days with the expectations that (a) localized regeneration after electrical injury (31) would initially activate both of the endogenous genes (em-bryonic and IIb), (b) activation of the endogenous embryonic MyHC gene would be transient as regeneration subsided, and (c) augmentation from the ChemE-2 subfragment might be required to achieve detectable signal above background. This experiment showed that the embryonic and IIb core promoters were comparably active in vivo and that they were both upregulated by the ChemE-2 subfragment (although to a lesser extent than seen in vitro). The comparative activity of the IIb promoter constructs at 4, 7, and 21 days probably reflects the role of trans-acting factors present in vivo but absent from the C2C12 system in vitro. The trends observed also suggest that any repression attributable to elements within the ChemE-2 subfragment is outweighed by activation from elements within the IIb core domain. At 21 days, however, the embryonic and IIb core promoters showed minimal and maximal activity, respectively, a result that contrasts sharply with the initial relative activity of these constructs in vitro but parallels that of the endogenous genes in vivo. Thus, the core promoter constructs qualitatively recapitulate the expression patterns of the cognate genes, providing tools for use in a detailed dissection of the comparative roles of trans-activation and trans-repression during development and regeneration. DISCUSSION Despite the critical importance of the sarcomeric MyHC in skeletal muscle function and the extraordinary abundance of The relative levels of expression observed between the IIb, IIx, and perinatal proximal promoters in the presence of the ChemE-2 are not statistically different from one another (p Ͼ 0.05), although there is a consistent upward trend when each of these is associated with the ChemE-2 (ratios shown to the right of paired bars on the graph). C2C12 cell extracts were prepared 72 h after induction of differentiation. Cell extracts were assayed for luciferase and ␤-galactosidase as described under "Experimental Procedures." The results are expressed as means Ϯ S.E. Normalized results are expressed as a x-fold induction over the activity of the pGL3-basic promoter (defined as 1). By one-way ANOVA, F ϭ 9.78 and p Ͻ 0.0001. this protein, the size and complexity of the human skeletal MyHC locus has delayed detailed investigation of the transcriptional control of its six tandemly linked genes. During the preparation of this manuscript, we became aware of a report on some recent in vitro studies using plasmid constructs based on the murine IIa, IId/x, and IIb proximal promoters (32). Our data on early developmental isoform switching complement these studies (which instead focused on possible mechanisms for fiber-type regulation of adult MyHC isoform switching). We have used newly available draft DNA sequence for the human and murine genomes as a basis for modeling important transcription factor binding sites in the cis-acting regulatory sequences. Analysis of paralogous skeletal MyHC genes reveals striking homology in the most proximal 300 bp of the embryonic, IIa, IId/x, IIb, and perinatal promoters, centered around three predicted MEF-2 sites. Our previous analysis of the evolutionary relationships among these genes, based wholly on comparisons of coding sequence for the "molecular yardsticklike" rod domain of the protein, suggests that this MEF-2 site rich "core" structure to the promoter dates back to a common ancestral gene that existed during early vertebrate evolution, before the human avian species split (2). Consistent with this prediction is the result of a phylogenetic footprint analysis of the human IId/x promoter with that from the most closely related chicken MyHC gene (Fig. 2H). The high probability regions match the core consensus at all three of the MEF-2 sites identified in the IIa, IId/x, and perinatal promoters. In contrast, neither the embryonic nor the IId/x promoters contain E boxes within 300 bp of the transcriptional start sites, and CANNTG consensus sequences identified in the other promoters are located at sporadic positions (IIa Ϫ52; IIb Ϫ61 and Ϫ215; perinatal Ϫ80 and Ϫ118; chicken "J02714" Ϫ266).
Our experimental findings with the set of core promoter constructs revealed that reporter expression roughly paralleled the relative levels of the corresponding isomyosins in myotubes in vitro (29) and in myofibers in vivo (33). In a detailed analysis of chimeric substitutions, none of the IIb sequence motifs identified in the sequence alignments could substitute for the ho- FIG. 9. Cis-activation exerted by the ChemE-2 subfragment is mediated by the proximal embryonic promoter. Deletions were made in the proximal portion of the embryonic promoter in order to define the possible interactions between ChemE-2 and the putative MEF-2 sites. Deleted regions in the plasmids KNK, KNJ, and KNI have been replaced with homologous portions of the IIb promoter in the plasmids KOF, KNX, and KNF, respectively. Comparison of the activities obtained with KNK with KOF, KNJ with KNX, and KNI with KNF or KNB are not statistically different (p Ͼ 0.05). All other pairwise comparisons are statistically significant (p Ͻ 0.05) by the Student's t test, although the p value rises to slightly above 0.05 for some of these comparisons (e.g. KNE with KOF) when the more conservative Tukey-Kramer HSD test is used. We conclude that the replacement of deleted regions of the embryonic promoter with homologous portions of the IIb promoter failed to restore lost activity. The numbers in circles represent the six potential E boxes found at Ϫ1460, Ϫ1382, Ϫ1151, Ϫ861, Ϫ785, and Ϫ733, respectively. Cell extracts were assayed for luciferase and ␤-galactosidase as described under "Experimental Procedures." The results are expressed as means Ϯ S.E. Normalized results are expressed as a x-fold induction over the activity of the pGL3-basic promoter (defined as 1). By one-way ANOVA, F ϭ 13.0 and p Ͻ .0001. mologous portions of the embryonic promoter without a significant loss of activity in vitro. This implies that the divergent sequences intervening between the consensus-matching portions of the core contribute to the process of promoter switching during development. Some of the divergent sequence falls within the three MEF-2 sites predicted in this region, raising the possibility that subtle differences in MEF-2 binding affinity could play a role in this process.
Mutation of the E box at Ϫ61 bp in a series of murine IIb promoter constructs has been shown to eliminate MyoD binding at this site in vitro and to inactivate transcription from these constructs in vivo (34). Our observation that the proximal embryonic promoter, in the complete absence of E box consensus CANNTG sites, drives much stronger marker gene expression than the IIb promoter in C2C12 cells supports the possibility that MRFs have acquired a modulatory role in the proximal promoters. Evidence from a wide range of studies has implicated tripartate interactions involving DNA, MEF-2, and additional transcription factors in the control of other musclespecific promoters during terminal differentiation (35). To the extent that the proximal promoters control the timing of skel-etal MyHC gene activation and repression, our bioinformatic analysis suggests that the underlying control mechanisms evolved around an ancestral "footprint" with three members of the MEF-2 family recruiting other transcription factors into multisubunit complexes. In the embryonic promoter, the most proximal predicted MEF-2 site uniquely overlaps with a predicted SRF site, raising the possibility that interaction or competition between these transcription factors plays a role in developmental regulation. Documented competition between SRF and YY1 at CArG boxes within the skeletal (36) and cardiac (37) ␣-actin promoters provide precedents for regulation by this mechanism.
The sequences between Ϫ300 and Ϫ2 kb upstream of each of the transcriptional start sites for the five human genes listed above reveal a multifocal pattern of conserved sequence motifs and evidence for strong selection against insertions or deletions (indels). This suggests that intermediate range positional relationships among groups of transcription factor binding sites has been preserved by functional constraint. Our finding of an E box-rich region devoid of high affinity MEF-2 sites (ChemE-2) that dramatically increased the expression of the FIG. 10. Activities of the embryonic and IIb promoters following electroporation in vivo. The ChemE2 subfragment dramatically increases the activity of both core promoters at 4 days postelectroporation. By 7 days, the effect is less dramatic because the relative activity of both core promoters is higher. At 21 days, the IIb core promoter remains high, while the embryonic promoter is down-regulated as the muscle completes its regenerative response. Plasmids (50 g) were injected in rat tibialis anterior (TA) immediately prior to electroporation. Electroporation was performed as described under "Experimental Procedures." After 4, 7, and 21 days, the rats were euthanized and necropsied to expose the limb muscles. The TA muscles were frozen in liquid nitrogen, crushed into powder, and treated for luciferase and ␤-galactosidase as described under "Experimental Procedures." The results are expressed are means Ϯ S.E. p Ͻ 0.05 by Student's t test for pairwise comparisons between the following data sets: Emb./ChemE2 day 4 versus Emb. core day 4, Emb./ChemE2 day 4 versus IIb core day 4, Emb./ChemE2 day 4 versus Emb. core day 7, and Emb./ChemE2 day 4 versus Emb. core day 21. core proximal promoter suggests that MRF and MEF-2 transcription factors, respectively, recruit other factors to physically distinct regions of the promoter to set the stage for synergistic interaction across a distance of several hundred base pairs. Despite evidence for evolutionary selection against indels between these two regions of the promoter, our deletional constructs preserve the full activity of the unaltered gene. The ability of the ChemE-2 subfragment to cis-activate the nonembryonic proximal promoters raises the possibility that this element functions in the native locus as either an enhancer or, by analogy to the ␤-globin locus, as a locus control region (38,39). In either case, our in vivo studies provide preliminary evidence that this region lacks the ability to repress the nonembryonic core promoters during postembryonic stages of development. Our largest embryonic gene constructs encompassed an even more distal upstream region (ChemE-3) that attracted considerable interest because of the inordinate size of the conserved sequence domain and the local prevalence of muscle-specific transcription factor binding sites. In the C2C12 system, the transcriptional activity of the embryonic constructs was maintained despite deletion of this conserved block of sequence. Further studies will address a range of possible roles for both of these putative regulatory elements.
The skeletal myosin heavy chains are among the most abundant proteins in the bodies of vertebrates, and they must accumulate with other contractile proteins in precisely controlled stoichiometric ratios. This gene dosage effect is best documented in Drosophila, where compensatory changes in actin gene number improve muscle structure by restoring proper stoichiometry (40). We speculate that throughout vertebrate evolution, the potential advantage of MyHC isoform diversification through gene duplication was offset by the attendant mutational disruption of established patterns of developmental regulation. The fact that these six genes remain tandemly linked suggests that their proper regulation requires proximity to shared transcriptional control elements (39). A teleological advantage of centralized control, exerted across the entire locus by single copy regulatory elements, is that the individual promoters are free to compete for varying proportions of the total MyHC mRNA mix without disrupting overall protein stoichiometry. Alternatively, if all of the genes are autonomously regulated by their own upstream regulatory sequences, downregulation of one gene must be precisely offset by up-regulation of another. Although it is premature to suggest detailed models for long range interaction between transcription factors binding the endogenous MyHC genes, two candidate control regions are identified in the current studies (ChemE-2 and -3). One of these is incorporated in the only minimal length skeletal MyHC promoter construct to show robust autonomous activity in vitro, perhaps providing a cassette powerful enough for functional expression of recombinant MyHCs. Further studies will establish whether the other promoter regions identified at this locus can autonomously drive heterologous protein expression in adult tissues at levels sufficient for assembly of the contractile apparatus, a sine qua non for isolation of the entire control region.
In summary, the foregoing studies illustrate the complementarity of readily available tools for in vitro, in vivo, and in silico analysis, as applied to the dissection of the transcriptional regulatory circuitry controlling the developmental program for the six-membered human skeletal MyHC gene family.