Pervasive Initiation and 3’ End Formation of Poxvirus Post-Replicative RNAs*

Poxviruses are large DNA viruses that replicate within the cytoplasm and encode a complete transcription system, including a multisubunit RNA polymerase, stage-specific transcription factors, capping and methylating enzymes, and a poly(A) polymerase. Expression of the more than 200 open reading frames by vaccinia virus, the prototype poxvirus, is temporally regulated: early mRNAs are synthesized immediately after infection, whereas intermediate and late mRNAs are synthesized following genome replication. The postreplicative transcripts are heterogeneous in length and overlap the entire genome, which pose obstacles for high resolution mapping. We used tag-based methods in conjunction with high throughput cDNA sequencing to determine the precise 5'-capped and 3'-polyadenylated ends of postreplicative RNAs. Polymerase slippage during initiation of intermediate and late RNA synthesis results in a 5'-poly(A) leader that allowed the unambiguous identification of true transcription start sites. Ninety RNA start sites were located just upstream of intermediate and late open reading frames, but many more appeared anomalous, occurring within coding and non-coding regions, indicating pervasive transcription initiation. We confirmed the presence of functional promoter sequences upstream of representative anomalous start sites and demonstrated that alternative start sites within open reading frames could generate truncated isoforms of proteins. In an analogous manner, poly(A) sequences allowed accurate mapping of the numerous 3'-ends of postreplicative RNAs, which were preceded by a pyrimidine-rich sequence in the DNA coding strand. The distribution of postreplicative promoter sequences throughout the genome provides enormous transcriptional complexity, and the large number of previously unmapped RNAs may have novel functions.

factors for expression and replication of their large genomes. A thorough understanding of poxvirus transcription will provide insights into their complex biology and may improve the design of vaccines and therapeutics based on poxvirus expression vectors.
The transcriptional program of vaccinia virus (VACV), the prototype poxvirus, is divided into early, intermediate and late stages that are regulated in a cascade fashion by sequential synthesis of specific transcription factors, which operate in conjunction with a viral multisubunit eukaryotic-like DNA-dependent RNA polymerase (8). The early transcription system, including RNA polymerase, transcription factors, capping enzyme and poly(A) polymerase, is packaged within infectious virus particles enabling synthesis of translatable mRNAs almost immediately after entry of the virus core into the cytoplasm. The de novo synthesized products of the early genes include the enzymes and factors needed for transcription of the intermediate genes on newly replicated viral DNA templates. The products of the intermediate genes include factors that enable transcription of late genes, the products of which include the early transcription factors that are packaged in assembling virions for use in the next round of infection. Because the expression of intermediate and late genes is dependent on viral DNA replication, they are referred to as postreplicative (PR) genes. Recent experiments suggest that there are 118 early and 93 PR genes with the latter divided into 53 intermediate, 38 late and 2 for which expression was not detectable by Western blotting (5,9). Early, intermediate and late stage promoters have distinctive sequences, although some have dual specificities (9)(10)(11)(12)(13)(14). A common, essential feature of intermediate and late promoters is three or more consecutive A residues in the coding strand that constitute the RNA start site. Frequently, the third A overlaps the ATG of the translation initiation codon. During transcription initiation, the RNA polymerase slips resulting in a novel 5' poly(A) leader (15)(16)(17)(18)(19). In addition, all classes of VACV mRNAs have 5' caps and 3' poly(A) tails synthesized by viral enzymes, but RNA splicing does not occur.
Though poxvirus genomes are much smaller than those of their hosts, they nevertheless pose daunting challenges for transcriptome analysis by conventional methods. The close spacing of open reading frames (ORFs) maximizes coding capacity, but makes it difficult to map individual transcripts. This is compounded by the extensive overlapping of transcripts, especially at the PR stage, which is evidenced by length heterogeneity of RNAs (20,21), occurrence of complementary double-stranded RNAs (22,23), and RNA coverage of virtually every nucleotide of the VACV genome (5,9). The heterogeneous RNA length has been attributed mainly to imprecise termination of PR mRNAs and to a lesser extent to 3' poly(A) tails and 5' poly(A) leaders. Another factor contributing to RNA heterogeneity, suggested by our previous analysis of VACV early transcripts (14), is pervasive transcription initiation. To evaluate these possibilities, it is necessary to analyze RNA 5' and 3' ends with high precision and to distinguish 5' ends derived by processing from those representing true transcription start sites (TSSs).
For the present study, we adapted tag-based RNA-seq methods (14,24,25) to analyze the 5' and 3' ends of VACV post-replicative mRNAs. Importantly, the poly(A) leaders allowed us to distinguish true TSSs from 5' ends produced by cleavage and demonstrate pervasive initiation of VACV post-replicative RNAs. In addition, the polyadenylated 3' ends were extremely heterogeneous. These data provide new insights into poxvirus transcription, imply the existence of previously unrecognized gene products and suggest a strategy for expressing new genes from endogenous or horizontally acquired DNA sequences.

EXPERIMENTAL PROCEDURES
Cells and viruses -HeLa S3 cells were cultured in minimum essential medium with spinner modification (Quality Biological, Gaithersburg, MD) and 5% equine serum (HyClone, Logan, UT) at 37 °C with 5% CO2. The Western Reserve (WR) strain of VACV (American Type Culture Collection VR-1354) was used. The preparation, sucrose-gradient purification, and titration of virus have been described previously (26,27 Generation of Illumina CAGE (Cap Analysis of Gene Expression) library, sequencing and data processing were described previously (14). Briefly, total RNA was treated with bacterial alkaline phosphatase (BAP) and tobacco acid pyrophosphatase (TAP). BAP treatment removes the 5' phosphate of broken transcripts. TAP treatment removes the cap of intact mRNA, leaving a phosphate at the 5' end for attaching a linker. The latter RNA was ligated to an RNA oligonucleotide (5′-ACACUCUUUCCCUACACGACGCUCUUCCG AUCUGG-3′) using T4 RNA ligase (TaKaRa, Clonetech Laboratories, Mountain View, CA). Dynabeads mRNA kit (Invitrogen) was used to isolate polyadenylated RNA. Random hexamer primers, with a C at their 3' ends, attached to the Illumina adaptor (5′-CAAGCAGAAGACGGCATACGAGCTCTTCC GATCTNNNNNNC-3′) were then used by reverse transcriptase (SuperScript II; Invitrogen) to synthesize first-strand cDNA. The cDNA was amplified by PCR using primers 5′-AATGATACGGCGACCACCGAGATCTACAC TCTTTCCCTACACGACGCTCTTCCGATCT-3′ and 5′-CAAGCAGAAGACGGCATACGAGCTCTTCC GATCT-3′. PCR products of 200 to 500 bp were gel purified for Illumina GA IIx sequencing. More than 2.6 and 5.6 million reads mapping to VACV genome were generated at 4 h and 8 h, respectively. Reads with at least 5 As at their 5' ends were extracted for further analysis. Those reads with 5' poly(A) leaders that mapped to positions with 6 or more As on the genome were excluded from analysis.
Generation of a SOLiD 3′-end library was described previously (14).
Briefly, doublestranded cDNA was prepared from polyadenylated mRNA using the SuperScript double-stranded cDNA synthesis kit as suggested by the manufacturer (Invitrogen). An anchored oligo(dT) 16 primer with a GsuI restriction site was used for reverse transcription [5′-biotin-TEG-GAGAGAGAGACTGGAG(T) 16 VN-3′], where V is A, C, or G and N is any nucleotide, was used for first-strand synthesis. After steps previously described (14), the resulting DNA was amplified using primers (5′-CCACTACGCCTCCGCTTTCCTCTCTATG-3′ and 5′-CTGCCCCGGGTTCCTCATTCT-3′) for SOliD sequencing.
The sequencing data were deposited in the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) under accession number SRA054540.
Consensus motif generation and search. The consensus motif of VACV promoters was generated using the Multiple Expectation maximization for Motif Elicitation (MEME) program (28).
Transfection and Western blotting. Transfections were performed using Lipofectamine 2000 (Invitrogen) according to the manufacturer's instructions. Cell lysates for Western blotting were prepared with SDSpolyacrylamide gel loading buffer and resolved by SDS-polyacrylamide gel electrophoresis. The proteins were transferred to nitrocellulose membranes, which were then blocked with 5% nonfat milk in Tris-buffered saline containing 0.05% Tween 20 for 1 h, followed by incubation with primary antibody for 1 h at room temperature. The membrane was washed three times with Tris-buffered saline with 0.05% Tween 20, and then incubated with secondary antibody conjugated with horseradish peroxidase or IRDye 800 antibody for 1 h at room temperature. The membrane was washed with Tris-buffered saline with 0.05% Tween 20, and developed using chemiluminescent substrate (Pierce, Rockford, IL) or a Li-Cor Odyssey infrared imager (Li-Cor Biosciences, Lincoln, NE).

RESULTS
Strategy for identification of TSSs -CAGE methods have been used to identify the capped 5' ends of eukaryotic and viral mRNAs. A modified version of this procedure is depicted in Fig. 1. Total RNA was first treated to remove the 5' phosphate on broken RNAs and prevent their subsequent sequencing. Next, the capped ends of intact RNAs were hydrolyzed in order to generate new 5' phosphate ends that could be ligated to a specific oligoribonucleotide linker. Additional steps involved selection of polyadenylated RNAs, reverse transcription, PCR amplification and deep sequencing. Although the oligoribonucleotide linker serves as a marker of the RNA 5' ends, not all ends represent TSSs; other ends have been attributed to RNA cleavage and recapping or technical problems in the analysis (1,(29)(30)(31). To surmount this ambiguity and identify true TSSs, we took advantage of a conserved feature of intermediate and late VACV transcription, the presence of three or more consecutive A residues in the non-template DNA strand at the RNA start site (16,32,33). Mutagenesis studies have shown that the AAA sequence is an essential element of intermediate and late promoters (10,12). Importantly, the propensity of the VACV RNA polymerase to slip on the AAA sequence during initiation of transcription results in a unique 5' poly(A) leader containing additional A residues not encoded in the genome (15)(16)(17)(18)(19), which we have used as a signature of a genuine TSS (Fig. 1). In theory non-templated As could be added during the reverse transcription or PCR step prior to sequencing. However, slippage by reverse transcriptase was not found in a previous study (34) and a threshold of 8 consecutive Ts are needed for slippage by Taq polymerase and even then the frequency is very low (35). Although 5' poly(A) leaders had only been demonstrated for relatively few VACV mRNAs, the ubiquity of the AAA sequence preceding intermediate and late ORFs suggested that this is a universal feature.
Genome-wide mapping of TSSs -cDNA libraries were prepared as described in Fig. 1 (5). The 76-nt reads allowed us to sequence through poly(A) leader sequences of about 50 nt or less. The reads that mapped to the VACV genome were filtered by excluding those that did not have at least 5 consecutive A residues at their 5' ends ( Fig. 1). We recovered 126,317 and 503,502 reads with 5 or more As at the 5' ends that mapped to 1,714 and 2,736 sites in the VACV genome from the 4 h and 8 h samples, respectively. A small number of reads that mapped to genomic regions with 6 or more consecutive As were excluded from further analysis. The 4 and 8 h TSSs were distributed throughout the length of the genome and on both DNA strands (Fig. 2). For comparison with early RNAs, of which relatively few were thought to have 5' poly(A) leaders (14,36,37), we reanalyzed the reads obtained at 2 h after infection (14). There was a total of 6,565 reads with 5 or more As at the TSSs and they mapped to 145 sites located mostly near the right end of the top strand and the left end of the bottom strand (Fig. 2).
The 195,000 bp, linear, double-stranded VACV WR genome contains approximately 200 ORFs and is organized with predominantly early genes near the two ends and intermediate and late genes, with intermixing of some early ones, in the central region. Translated ORFs are present on both strands, although there is little overlap and a tendency for transcription to occur toward the nearer end of the genome. In Fig. 3, TSS data from the 8 h sample is shown. The lengths of the lines indicate the number of reads in hundreds and those upwards are transcribed from the top DNA strand in the rightwards direction and those downwards are transcribed from the bottom DNA strand in the leftward direction. The red lines indicate 5'-ends that map within 25 nt upstream of annotated ORFs and are therefore very likely to represent their TSSs, although further upstream TSSs may also be used. The paucity of PR genes near the two ends of the genome accounts for the very few TSSs that map close to ORFs in those regions. The blue lines include 5' ends that are further upstream of ORFs, next to pseudogenes, within ORFs and in noncoding regions and are collectively referred to as anomalous TSSs. There was a wide range in the number of read counts for individual TSSs close to annotated ORFs, as well as at anomalous sites.
Since the mRNAs represent steady state levels, the variation could reflect differences in RNA synthesis or stability. With regard to the latter, studies have suggested that the half-lives of total PR RNAs could be less than 15 min (38,39).
Analysis of TSSs upstream of PR ORFs -TSSs were documented upstream of 90 of the 93 previously annotated intermediate and late PR ORFs (Table 1). TSSs were identified within 25-nt upstream of the start codons of 75 PR ORFs and for 53 of these there were no additional nucleotides between the TSS and the translation initiation codon, underscoring the frequency of the promoter motif TAAATG. A TSS was between 25-and 100-nt upstream of initiation methionines of 15 ORFs (Table 1). In several of the latter, there was an additional apparently silent AAA sequence between the TSS and the translation initiation codon. The distances between the 5' ends of the RNAs and translation initiation codons are shown in Fig. 4A and Table 1.
The number of As in the poly(A) leader varied even for those mapping to the same site. The longest poly(A) leader contained 51 As. However, poly(A) leaders with additional As would not be mapped because of the relatively short reads obtained by Illumina sequencing. The most frequent lengths of the poly(A) leaders in the transcripts of intermediate and late ORFs were 8 and 11 As, respectively (Fig. 4B). These numbers are in the range previously found by direct analysis of poly(A) leader lengths (18). However, the anomalous transcripts described below had shorter poly(A) leaders (Fig. 4B).
To estimate the frequency of slippage during transcription initiation, we determined the number of 8 h reads with less than 5 As that mapped to the AAA sequences of the 90 TSSs preceding PR ORFs. We found 6,230 reads with less than 5 As compared to 103,722 with 5 or more As. Thus, slippage occurred at nearly 95% of initiations at these sites. Although here-to-fore the presence of 5' poly(A) leaders had only been demonstrated for relatively few mRNAs, the present study indicated that this is a general feature of PR transcripts.
At 4 and 8 h after infection, we also found TSSs of RNAs with 5' poly(A) leaders within 25nt upstream of start codons of 40 genes that are expressed prior to DNA replication ( Table 1), suggesting that slippage also occurs at early stages of gene expression or that these early promoters also contain functional intermediate or late promoter elements.
Analysis of TSSs in anomalous locations -The TSSs classified as anomalous could be grouped into several categories. For example, VACVWR ORFs 145-148 (Fig. 3) (9). We also demonstrated by Western blotting that the protein encoded by ORF 147 with a 3xFlag tag coding sequence added at the Cterminus of ORF 147 in the VACV genome was expressed (data not shown). Fig. 3 shows many examples of TSSs within ORFs. The anomalous TSSs were more commonly mapped to the sense strand than the anti-sense strand. RNAs that are initiated within ORFs have the potential to be translated into shorter isoforms of proteins. We examined ORF 148 (encoding the A25 protein), one of the longest ORFs in the VACV WR genome, to test this possibility. Plasmids containing ORF 148 with a C-terminal flag tag, with or without its natural promoter, were constructed (Fig. 5A) (12,40,41). Thus, protein expression from a transfected plasmid in the presence of AraC is indicative of an intermediate promoter. The expression of intermediate promoters in the absence of AraC can vary. However, expression exclusively in the absence of AraC is indicative of a late promoter. The cell extracts were analyzed by Western blotting with anti-Flag antibody. Four major bands were detected from cells transfected with the plasmid containing the promoter upstream of ORF 148 in the presence and absence of AraC (Fig. 5B). The few additional bands present in the absence of AraC might have resulted from protein cleavage under conditions of viral late protein synthesis. Importantly, without the upstream promoter sequence, the three lower bands were still detected indicating the presence of internal promoters within the long ORF (Fig. 5B). The nearest TSSs that could initiate the mRNAs for these isoforms were at bp 138,301, 137,636 and 136,986 in the bottom strand of the VACV genome respectively. The promoters were classified as intermediate based on expression in the presence of AraC.
Some TSSs were very close to the start codons of previously unannotated small ORFs. Four examples of such ORFs that are conserved in other orthopoxviruses were further analyzed (Fig. S1). The four ORFs, with a 3xFlag tag appended to the C-terminus and approximately 50 bp of upstream sequence, were inserted into plasmids. The plasmids were transfected into VACV infected BS-C-1 cells and the lysates were analyzed by Western blotting. However, no proteins were detected by anti-Flag antibody (not shown). Since this result could be due to failure of transcription/translation or protein instability, we made additional plasmids with the 50 bp putative promoter sequences upstream of GFP. Two sets of plasmids were made: one set had the natural upstream sequences and the other had mutations predicted to inactivate the putative promoter (Fig.  6A). The cells were infected with VACV in the absence or presence of AraC in order to differentiate between intermediate and late promoters. GFP was detected by Western blotting and much higher expression was demonstrated in the absence of AraC for at least two of the promoters indicating that they can be classified as late (Fig. 6B). Mutation of the AAA promoter element and surrounding nucleotides (Fig. 6A) abolished GFP expression in each case (Fig. 6C), confirming that functional promoters regulated the anomalous TSSs even though protein expression of the short ORFs had not been detected.
Comparison of sequences adjacent to TSSs -We were interested in comparing the sequences upstream of the TSSs at anomalous locations with those of the annotated VACVWR ORFs. In our previous expression profiling of VACV intermediate and late genes, we characterized their consensus promoter motifs by analysis of sequences upstream of the ORFs (9). The present determination of TSSs allowed a more precise and confident search for consensus sequences. Compared to the previous results, the refined late class consensus had a more conserved TG following TAAA and a more conserved T at 10 nt upstream of TAAA, while the intermediate class had a more AT-rich sequence (with a preference of As) ~15 to 25 nt upstream of TAAA (Fig. 7). The most frequent nucleotide preceding and following AAA was found to be T for both classes. A third class of ORFs appeared to be expressed strongly at both intermediate and late times and resembled the late motif with regard to T residues (Fig. 7). The promoter motif, derived for the sequences upstream of the unannotated TSSs with more than 100 reads, was similar to that of a mixture of intermediate and late promoters at annotated ORFs (Fig. 7).
Analysis of PR 3' polyadenylation sites (PASs) -The 3' ends of early and PR mRNAs appear to be formed by distinct mechanisms. Biochemical studies indicated that early mRNAs terminate within 50 nt after a UUUUUNU sequence and are subsequently polyadenylated (42,43), although recent sequence analyses indicate that this does not account for all 3' polyadenylated ends (14). In contrast, the VACV PR transcription apparatus does not recognize UUUUUNU as a termination sequence and the late mRNAs are heterogeneous in size (20,21). Although a few PR mRNAs have been shown to be processed by cleavage (44)(45)(46)(47), the situation for the majority is unknown. A previously described scheme used to isolate the RNA sequences adjacent to the 3' poly(A) tails of early mRNAs (14) was adapted for RNAs made at 4 and 8 h after infection. A biotinylated poly(dT), dinucleotide-anchored primer with a GsuI type IIs restriction endonuclease site was used for reverse transcription. The anchor was comprised of all possible dinucleotide pairs ensuring that reverse transcription did not initiate further within the poly(A) tail, which would have resulted in cDNAs too long for high throughput sequence analysis. After second strand synthesis, the cDNA was cleaved with GsuI to leave two As marking the original poly(A) site (14). After additional steps, the short DNA fragments were sequenced. A total of 103,340 and 704,615 reads were mapped to 7,221 and 16,931 sites on the VACV genome at 4 and 8 h after infection, respectively. Thus, the number of PASs was about 5 times the number of TSSs containing 5' poly(A) leaders. Data from the 8 h sample were plotted relative to an ORF map in Fig. 8. The PASs were predominantly on the same strand as the nearby TSSs and appeared to occur in clusters.
We had previously found that coding strand sequences upstream of the PASs of early RNAs were pyrimidine-rich regardless of the presence of the UUUUUNU termination motif (14). A pyrimidine-rich sequence was also found in RNAs at 4 and 8 h after infection (Fig. 9). The presence of the UUUUUNU early termination sequence was infrequent even when the 64 most abundant PASs accounting for >0.1% of the total were examined.

DISCUSSION
The presence of a 5' poly(A) leader is a feature of VACV PR transcripts that results from slippage of the RNA polymerase at the conserved AAA sequence of PR promoters. We took advantage of this trait to distinguish true TSSs from processed 5' ends. Using this criteria, TSSs were mapped just upstream of 90 of the 93 previously annotated intermediate and late ORFs. Approximately 95% of the transcripts from these sites had poly(A) leaders of at least 5 nucleotides. The median length of 5' poly(A) for intermediate and late mRNAs was 8 and 11, respectively. As an A of the ATG translation initiation codon frequently overlaps the TAAAT promoter sequence, slippage provides a leader that could enhance translation (48,49). Poly(A) leaders occur in some yeast mRNAs and the protein synthesis rate increases with the length of the poly(A) leader until about 12 residues and then decreases dramatically possibly because of association of the poly(A) binding protein (50). The VACV poly(A) leaders formed at the anomalous TSSs were generally shorter, suggesting that they may not be optimal for translation.
Unexpectedly, the number of TSSs with poly(A) leaders greatly exceeded the number of annotated VACV ORFs and many mapped within or far upstream of coding regions or adjacent to pseudogenes. Indeed, about 25% of the genomic AAA sequences are used as TSSs. The anomalous TSSs occurred more often on the sense than the anti-sense strand. This could result from a tendency of the RNA polymerase to remain associated with the DNA after transcription termination and then reinitiating at a downstream site. This scenario could also account for the clustering of ORFs on the same strand.
The 3' ends of VACV RNAs were identified by sequencing non-templated A residues, which were added by the virus-encoded poly(A) polymerase (51,52). The VACV poly(A) polymerase has little sequence specificity except for multiple U residues 30 to 40 nt upstream of the 3' end (53). Polyadenylation of the 3' ends of VACV early mRNAs has been well studied and often occurs following termination, which is mediated by VACV-encoded enzymes working in conjunction with the RNA polymerase at 25 to 50 nt after a UUUUUNU sequence (42,54,55). Additional polyadenylated 3' ends occur immediately after a pyrimidine-rich sequence (14). Much less is known about 3' end formation of VACV PR mRNAs, which are extremely heterogeneous in length (20,21). The UUUUUNU termination sequence of early mRNAs is not recognized in PR transcripts and the present study is the first to provide a global analysis of their 3' ends. Our finding of a very large number of distinct polyadenylated 3' ends was consistent with the heterogeneity of PR RNAs and suggested the absence of a highly specific recognition motif. However, we did note the consistent occurrence of a pyrimidine-rich sequence in the coding strand immediately upstream of post-replicative PASs that was similar to that of early PASs that were not preceded by the UUUUUNU signal. The possible role of the VACV-encoded NPH-II in early mRNA termination was suggested by the generation of abnormally long early RNAs by NPH-II-defective virions (56). NPH-II has DNAand RNA-dependent triphosphate phosphohydrolase and RNA helicase activities (57)(58)(59) activities. Furthermore, NPH-II can efficiently unwind a DNA-RNA hybrid containing a pyrimidine-rich DNA tracking strand (60), which as shown here precedes PR PASs. Thus, NPH-II may be involved in termination of PR and early transcripts.
Several poxvirus PR mRNAs are processed at specific sites by cleavage (45)(46)(47)61,62). The VACV-encoded H5 protein or a protein associated with it has been implicated in processing of the F17R (VACWR056) transcript (46). We found many polyadenylated 3' ends (indicated with an asterisk in Fig. 8) that overlapped the described cleavage site downstream of the 056 ORF rather than a single predominant end (47).