Mutual Information Analysis Reveals Coevolving Residues in Tat That Compensate for Two Distinct Functions in HIV-1 Gene Expression*

Background: It is not well understood how HIV-1 Tat performs complex functions despite its considerable sequence diversity. Results: Individually nonconserved but functionally coevolving sites in Tat were identified. Conclusion: In contrast to most coevolving sites, these sites are critical for two distinct functions regulating viral gene expression. Significance: Balancing the two mechanisms is a strategy to maintain Tat function and may impact viral latency. Viral genomes are continually subjected to mutations, and functionally deleterious ones can be rescued by reversion or additional mutations that restore fitness. The error prone nature of HIV-1 replication has resulted in highly diverse viral sequences, and it is not clear how viral proteins such as Tat, which plays a critical role in viral gene expression and replication, retain their complex functions. Although several important amino acid positions in Tat are conserved, we hypothesized that it may also harbor functionally important residues that may not be individually conserved yet appear as correlated pairs, whose analysis could yield new mechanistic insights into Tat function and evolution. To identify such sites, we combined mutual information analysis and experimentation to identify coevolving positions and found that residues 35 and 39 are strongly correlated. Mutation of either residue of this pair into amino acids that appear in numerous viral isolates yields a defective virus; however, simultaneous introduction of both mutations into the heterologous Tat sequence restores gene expression close to wild-type Tat. Furthermore, in contrast to most coevolving protein residues that contribute to the same function, structural modeling and biochemical studies showed that these two residues contribute to two mechanistically distinct steps in gene expression: binding P-TEFb and promoting P-TEFb phosphorylation of the C-terminal domain in RNAPII. Moreover, Tat variants that mimic HIV-1 subtypes B or C at sites 35 and 39 have evolved orthogonal strengths of P-TEFb binding versus RNAPII phosphorylation, suggesting that subtypes have evolved alternate transcriptional strategies to achieve similar gene expression levels.

Viral genomes are continually subjected to mutations, and functionally deleterious ones can be rescued by reversion or additional mutations that restore fitness. The error prone nature of HIV-1 replication has resulted in highly diverse viral sequences, and it is not clear how viral proteins such as Tat, which plays a critical role in viral gene expression and replication, retain their complex functions. Although several important amino acid positions in Tat are conserved, we hypothesized that it may also harbor functionally important residues that may not be individually conserved yet appear as correlated pairs, whose analysis could yield new mechanistic insights into Tat function and evolution. To identify such sites, we combined mutual information analysis and experimentation to identify coevolving positions and found that residues 35 and 39 are strongly correlated. Mutation of either residue of this pair into amino acids that appear in numerous viral isolates yields a defective virus; however, simultaneous introduction of both mutations into the heterologous Tat sequence restores gene expression close to wild-type Tat. Furthermore, in contrast to most coevolving protein residues that contribute to the same function, structural modeling and biochemical studies showed that these two residues contribute to two mechanistically distinct steps in gene expression: binding P-TEFb and promoting P-TEFb phosphorylation of the C-terminal domain in RNAPII. Moreover, Tat variants that mimic HIV-1 subtypes B or C at sites 35 and 39 have evolved orthogonal strengths of P-TEFb binding versus RNAPII phosphorylation, suggesting that subtypes have evolved alternate transcriptional strategies to achieve similar gene expression levels.
Genomes are continuously subjected to mutations that can in many cases undermine the structure and function of their encoded proteins. These processes may be especially important in rapidly evolving genomes, such as those of RNA viruses, which feature high rates of mutation and recombination during replication (1,2). For example, the retrovirus human immunodeficiency virus-1 (HIV-1), 4 exhibits enormous sequence diversity, both within individual patients and among numerous subtypes and recombinants circulating throughout the world, which in many cases reduces viral fitness but can also promote its ability to adapt to different selective pressures applied by the host immune system and anti-retroviral drugs (2)(3)(4)(5)(6). In many cases, purifying selection purges mutations that reduce overall fitness; however, deleterious mutations at one residue may also be compensated for by mutations at other sites to maintain protein structure and function (7). Such compensating posi-tions within a protein may conceal significant evolutionary and functional information, yet are not readily apparent from an analysis of protein sequence. We use an approach whereby discovering coevolving sites within the HIV-1 protein Tat (transactivator of transcription) allows us to elucidate the underlying functional mechanism that constrains these sites to certain residue pairs for optimal function of this important viral protein.
Tat, which displays complex interactions with several cellular and viral factors that are critical to activating gene expression from the viral promoter, is an interesting substrate for analysis of the effects of protein evolution on complex, multifaceted protein functions. Briefly, once HIV-1 infects a host cell and integrates into its genome, transcription factor binding sites within the viral promoter recruit host cellular factors and mediate a low, basal level of gene expression in which transcriptional elongation is inefficient and yields primarily abortive transcripts (8). However, a small number of full-length viral transcripts are produced and spliced to yield a mRNA species that encodes Tat. The few Tat molecules that are formed from this basal transcription bind to the positive transcriptionelongation factor b (P-TEFb), which consists of the cellular proteins cyclin T1 (CycT1) and cyclin-dependent kinase 9 (Cdk9) (9). The resulting Tat⅐P-TEFb complex then binds to a RNA hairpin, transactivation response element, present at the 5Ј end of all viral transcripts and is subsequently transferred to the pre-initiation complex, wherein Cdk9 phosphorylates the C-terminal domain (CTD) of RNA polymerase II (RNAPII) and thereby greatly increases RNAPII processivity (10,11). In a recently proposed, alternate mechanism, P-TEFb and Tat may be recruited to the viral promoter in an inactive form, and the newly synthesized transactivation response element may then bind to Tat and P-TEFb and displace the inhibitory 7SK snRNP to activate P-TEFb, which phosphorylates RNAPII (12). In either case, the increased RNAPII processivity greatly elevates viral gene expression and initiates a cascade of HIV-1 replication.
In addition to its interactions with RNAPII through P-TEFb, the multifunctional Tat interacts with numerous other host factors, and a balance among these various interactions and post-translational modifications is likely necessary for effective overall function (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23). However, it is unclear how Tat, despite its considerable sequence diversity among HIV-1 isolates globally, is able to mediate these critical processes of coopting a series of host cellular mechanisms to orchestrate viral replication. Sequence alignment of a protein can enable the identification of single amino acids that are conserved and hence potentially important for specific functions, and a number of such sites have been identified within Tat (e.g. supplemental Fig. S1) (13,16). However, we hypothesized that correlated and coordinated amino acid changes may have played a role in diversifying the sequence of Tat while preserving its interactions with many host proteins and thus its overall function (13)(14)(15)(16)(17)(18)(19)(20)(21)(22)(23). Furthermore, sequence alignments can readily miss situations where neither of the two given residues is individually conserved, but instead where correlated pairs that make important contributions to protein structure and/or function appear together. Thus, bioinformatic and statistical approaches may help identify such sites and thereby gain greater molecular insights into a protein critical for HIV pathogenicity.
To address these hypotheses, we applied a statistical measure termed Mutual Information (MI) (24), one of several methods that can be used to identify correlated position pairs from multiple sequence alignments of proteins (25)(26)(27). Previously, such statistical measures have been applied to HIV-1 proteins, including the V3 loop of the env gene and the gag gene (28 -30); however, these elegant computational analyses were not accompanied by experimental investigation. Similarly, such analysis has also been applied to other biological systems (31). Because the background noise in multiple sequence alignments makes it difficult for most statistical measures to predict correlated residues accurately, accompanying experimental validation of these sites is critical to identify structurally or functionally constrained residue pairs.
Here, we present a combined computational and experimental approach to identify and investigate coevolving residues in Tat. Sites 35 and 39 emerged in this analysis, and the functional importance of these coevolving residues was verified experimentally by introducing single point mutations in Tat. Whereas the single mutants proved nonfunctional, adding the second mutation restored viral gene expression. Surprisingly, despite their structural proximity, positions 35 and 39 appeared to be important for two distinct, underlying mechanisms, Tat binding to P-TEFb and Tat-mediated activation of P-TEFb to enable it to phosphorylate the CTD of RNAPII, and a combination of these two functions constrains the identities of these residues to certain pairs of amino acids. Extending this analysis indicates that the Tat proteins of HIV-1 subtypes B and C appear to have evolved compensatory strengths for different steps of Tat-mediated transactivation to achieve similar overall viral gene expression levels.

EXPERIMENTAL PROCEDURES
Cell Culture-Jurkat cells, used for infections, mRNA extraction and ChIP assays, were cultured in RPMI 1640 (Mediatech). HEK 293T cells, used for viral packaging, were cultured in Isocove's DMEM (Mediatech). HeLa cells, used for co-immunoprecipitation experiments, and the HL3T1 cell line, used for the Luciferase assay, were cultured in DMEM. All cell media were supplemented with 10% fetal bovine serum (FBS) and 100 units/ml of penicillin-streptomycin. All cells were grown at 37°C and 5% CO 2 .
Mutual Information Analysis-Matlab codes for calculation of raw and background MI scores to estimate the corrected MI scores will be made available upon request.
Statistical Analyses-All statistical significances were computed using one-way analysis of variance followed by the Tukey-Kramer multiple comparison method to compare different pairs. Details of plasmids, flow cytometry and cell sorting, transfections, mRNA extraction and RT-qPCR, co-immunoprecipitation and Western blots, chromatin immunoprecipitation, nuclease sensitivity assay, luciferase assay, and structure modeling can be found in supplemental "Experimental Procedures."

Mutual Information Analysis Identifies Sites 35 and 39 in Tat
as Coevolving-To identify correlated sites within Tat that are potentially important for maintaining Tat structure or function, we calculated MI between position pairs for 917 prealigned Tat sequences, from 9 viral subtypes and 14 recombinant forms, from the Los Alamos Sequence Database (hiv.lanl.gov). To encode a large amount of information within a relatively small genome, HIV-1 uses overlapping reading frames and alternative splicing. Because Tat shares overlapping reading frames with another viral protein, Rev, beyond amino acid 47, analysis was restricted to the first 46 amino acids of Tat within its activation domain (amino acids 1-48) to ensure that the sequence conservation and structural constraints in Rev did not introduce false positives in the analysis.
Within a multiple sequence alignment, MI predicts the likelihood of observing an amino acid at a particular site X in an alignment, given the identity of the amino acid at another site Y, and is computed by, where X, Y, ʦ(1,2 . . . 21) corresponds to one of the 20 amino acids or an alignment gap. p(x) and p(y) correspond to the probability of observing amino acid (or gap) x or y at sites X and Y, respectively, and p(x,y) is the corresponding joint probability of observing amino acids (or gaps) x and y at sites X and Y. Higher MI values indicate stronger correlation between two sites.
Using a method described by Dunn et al. (33) to minimize background noise, three position pairs, (35,39), (31,35), and (31,39), were identified with MI scores higher than a threshold score, computed using a method described by Weigt et al. (Fig.  1A, B and supplemental Fig. S2). See supplemental "Results" for details (33,34). For positions 35 and 39, frequencies of the different amino acids, based on the Tat sequence database, show that a Leu at position 35 constrains position 39 primarily to a Gln. Similarly, a Gln at position 35 results in a majority of Tat sequences having an Ile, Leu, or Thr but not a Gln at position 39 ( Fig. 2A). Based on the MI analysis, positions 31, 35, and 39 were chosen for experimental testing. In addition, a relatively conserved site that is coevolving with another site may in general have a low MI score due to certain mathematical constraints (see supplemental "Results" for details) (24). We therefore also experimentally tested several sites whose MI did not reach threshold. Furthermore, we tested several nonconserved sites within background MI as controls to validate the low predicted functional relationship between such sites. Experimentally tested positions are shown by solid black dots (Fig. 1B).
Gene Expression Analysis of Coevolving Sites 35 and 39-If two sites functionally coevolve, then mutating either individually may impair biological activity, whereas mutating both together may rescue function. To test this hypothesis, amino acids in Tat from subtype B virus, broadly used in HIV-1 studies and referred to here as wild-type (WT) Tat (supplemental Fig.  S1), were replaced with residues of other naturally occurring viral variants or subtypes ( Fig. 2A). To test how sites with high MI scores predict Tat function, we studied the gene expression properties of different Tat mutants using a lentiviral vector model of HIV-1, in which the HIV-1 LTR drives expression of green fluorescent protein (GFP) and Tat, separated by an internal ribosome entry sequence (IRES) (LTR-GFP-IRES-Tat or LGIT) ( Fig. 2B) (35). GFP expression from single integrations of LGIT in Jurkat cells was used to quantify LTR gene expression, and as previously observed the Tat positive feedback loop results in a bifurcated cell population with either very low or high levels of GFP expression, referred to as the Off and On populations, respectively (supplemental Fig. S3A) (35). Two metrics were used to quantify gene expression (36). First, the "Mean On Peak" indicates the average GFP level of cells above the background threshold of fluorescence (the On gate), a measure of the level of "closed loop" transactivation attained by a particular Tat mutant. Second, the "Percentage Infected but Off" is the fraction of infected cells that are in the Off population, but that can be stimulated to express Tat and GFP via the addition of TNF-␣ (a strong activator of the NF-B pathway, which directly activates the LTR) and trichostatin A (an inhibitor of histone deacetylases that also activates HIV gene expression) (36). This metric is a measure of the inability of a particular Tat mutant to activate gene expression from the viral LTR.
When introduced into WT Tat, single point mutations Q35L or I39Q, chosen from the matrix in Fig. 2A, yielded nonexpressing virus ( Fig. 2C and supplemental Fig. S3A). The percentage of infected but off cells for these single mutants was ϳ70%, nearly three times higher than WT Tat, again indicating that these Tat mutants fail to activate gene expression from the viral LTR (Fig. 2D). Strikingly, however, the introduction of both mutations into the same Tat sequence (henceforth referred to as the double-mutant, DM) rescued Mean On Peak levels close to that of WT Tat, with a similar fraction of silenced cells (Fig. 2, C and D, and supplemental Fig. S3A). Similarly, transfection of the WT, Q35L, I39Q, and DM Tat into a HeLa cell line containing a HIV-1 LTR Luciferase reporter showed analogous trends (supplemental Fig. S3B). From the 917 Tat sequences used in the MI analysis, 124 sequences shared the same Gln 35 -Ile 39 residue pair as WT Tat, and 262 sequences shared the same Leu 35 -Gln 39 residue pair as DM Tat, suggesting that both residue pairs are found in naturally occurring Tat sequences. Furthermore, to show that coevolution between sites 35 and 39 extend to another Tat subtype, we made single point mutants L35Q and Q39I of subtype C Tat. As previously observed for single mutations of Tat B, these mutants resulted in dramatic loss of gene expression and a 3-fold increase in Percentage Infected but Off cells (supplemental Fig. S4, C and D). However, gene expression was restored in the Tat C double mutant, suggesting that sites 35 and 39 alone compensate for one another (supplemental Fig. S4, C and D). Thus, evolutionarily only certain pairs of amino acids at sites 35 and 39 but not their intermediates are tolerated.
In analysis of site 31 in Tat B, the mutation C31S also yielded a slight reduction in gene expression and a statistical increase in the Percentage of Infected but Off cells as compared with WT Tat (p Ͻ 0.01); however, coevolution between site 31 and sites 35 or 39 proved difficult to investigate, as mutation at either of the latter two exerted a dominant loss of gene expression. However, the interaction between sites 31, 35, and 39 can be observed statistically. A Gln at site 39 constrains sites 31 and 35 primarily to a Ser and Leu, respectively. Similarly, a Leu, Ile, or Thr at site 39 primarily restricts sites 31 and 35 to a Cys and Gln, respectively (supplemental Fig. S5). These interactions between sites 31, 35, and 39 are discussed from a structural perspective in additional detail in supplemental "Results." We next replaced Gln 35 and Ile 39 in WT Tat with other amino acids based on residues that either appear or do not appear at sites 35 or 39 in the Los Alamos Sequence Database. As anticipated, replacing Gln 35 with similar polar residues that are not found (Q35N, Q35E, and Q35K) or rarely found (Q35T) in the database, and are thus not predicted to coevolve with residues at site 39, resulted in nonfunctional Tat proteins (supplemental Fig. S4, A and B). Similarly, replacing Ile 39 in WT Tat with nonpolar residues (I39F), or polar residues that occupy similar side chain volumes (I39K) but are not found in the database, gave rise to nonexpressing viruses (supplemental Fig. S4,  A and B). In contrast, naturally occurring Tat sequences with a Gln at site 35 have residues such as Leu and Val in addition to Ile at site 39 ( Fig. 2A, Val is not shown in this matrix). I39L and I39V Tat mutants yielded similar Mean On Peak and Percentage Infected but Off levels as WT Tat (supplemental Fig. S4, A  and B), validating the predictions from MI analysis. These mutations are discussed in greater detail from a structural and energetic perspective in supplemental "Results." Compared with sites above the threshold MI value, mutation of positions 7, 12, 17, 19, 24, 29, 32, 40, or 42, which were predicted to be within the background region from the MI analysis, did not show any statistical difference in the level of gene expression or the Percentage of Infected but Off cells compared with WT Tat (p Ͼ 0.01, Fig. 2, C and D). These gene expression studies thus strongly support the in silico prediction that sites 35 and 39 strongly coevolve and are critical to ensure the primary function of Tat as a transactivator of the HIV-1 promoter (Fig. 2, C and D).
Finally, WT and DM Tat have different pairs of amino acids at sites 35 and 39 yet induce similar levels of gene expression when present at high levels (Fig. 2C); however, we also wanted to explore their gene activation behavior at lower concentrations. Under these conditions, previous studies have shown that the Tat positive feedback loop is subject to stochastic fluctuations in Tat, one of several factors that may play a role in viral reactivation from latency (35,36). Because DM Tat has amino acids Leu and Gln at sites 35 and 39, respectively, residues shared by a majority of subtype C Tats at these positions, we included subtype C Tat in these studies to determine the

Functional Characterization of Coevolving Sites in HIV-1 Tat
MARCH 9, 2012 • VOLUME 287 • NUMBER 11 contribution of sites 35 and 39 to this behavior. As described previously, cells were infected with LGIT variants at low multiplicity of infection, and GFP ϩ cells were sorted by fluorescence activated cell sorting (FACS) 7 days post-infection after stimulation with TNF-␣ (36). The sorted cells were allowed to relax for 9 days, and GFP Ϫ cells infected with the LGIT vector, but not expressing GFP, were sorted (supplemental Fig. S6A). These silent proviruses were then monitored for GFP expression over the course of 12 days (supplemental Fig. S6B).
As anticipated, the functionally inactive Q35L and I39Q Tat variants showed very low levels of gene activation. However, the DM Tat partially restored gene expression to WT Tat levels, and very closely tracked the activation rate of Tat C (supplemental Fig. S6B), a result that suggests that residue pairs at sites 35 and 39 may be important determinants in setting the gene activation levels for different Tat variants.
Coevolving Sites 35 and 39 Impact Both P-TEFb Binding and Phosphorylation at the CTD of RNAPII-To identify potential molecular mechanisms that restore gene expression for the DM Tat, we reasoned that the compromised transactivation of either Tat single mutant may be due to disruption in its binding to one of numerous cellular factors necessary for efficient transactivation. For example, the activation domain of Tat (amino acids 1-48) has previously been shown to interact with P-TEFb, which mediates the critical phosphorylation of the CTD of RNAPII and thus the production of full-length viral transcripts (9,37). HeLa cells were transfected with plasmids to express FLAG-tagged Tat under the control of the human ubiquitin promoter (Ubiquitin-mCherry-IRES-Tat or UbChIT), and immunoprecipitates of Tat were probed for Cdk9 and CycT1 (Fig. 3, A and B) (18). WT Tat bound P-TEFb; however, the Q35L Tat mutant failed to efficiently bind either Cdk9 or CycT1, suggesting that site 35 is critical for binding P-TEFb (Fig. 3, A and B), and the loss of this binding possibly underlies the defective gene expression for this mutant (Fig. 2, C and D). Similarly, other factors that have recently been shown to interact with the Tat⅐P-TEFb complex and aid in transcriptional activation, such as ENL, AF9, AFF4, and ELL2, failed to bind to the Q35L Tat mutant (supplemental Fig. S7) (38). Interestingly, the DM Tat partially restores binding with Cdk9 and CycT1, likely the mechanism by which the I39Q mutation rescues the loss of function for the Q35L Tat mutant (Fig. 3, A and B).
To gain further insights into the loss of P-TEFb binding, we performed in silico modeling based on a recently solved structure of Tat⅐P-TEFb (37). These results indicated that Asn 180 of CycT1 is positioned between and can form hydrogen bonds with a Gln at either Tat site 35 or 39. The Q35L mutation results in the loss of this hydrogen bonding, whereas the compensating mutation I39Q in the DM Tat enables Gln 39 to replace this hydrogen bond with Asn 180 in CycT1 (Fig. 3C). Although a previous study predicted that other naturally occurring mutations (except Tyr) could readily be accommodated at site 35 and maintain the structure of the protein complex (37), our analysis and accompanying experimental data suggest that the loss of hydrogen bonding may well be responsible for the drastic loss of function observed in the Q35L Tat mutant.
In contrast to the Q35L Tat mutant, however, the I39Q Tat mutant is able to bind CycT1 at levels close to the DM Tat (Fig.  3B), but lower than WT Tat, suggesting that its inability to activate gene expression (Fig. 2C) arises from reasons other than P-TEFb binding. To probe other transcriptional steps at which I39Q Tat may fail, we quantified viral transcripts. Cells infected with LGIT vectors containing one of the four Tat variants were stimulated with TNF-␣ 7 days post-infection, and infected, GFP ϩ cells were isolated by FACS. The sorted cells were allowed to relax for 9 days (Fig. 4A), total cellular RNA was extracted, and the levels of viral transcripts were quantified using RT-qPCR (36). The I39Q and DM Tat both had similar percentages of elongated transcripts (Fig. 4B); however, the I39Q Tat has much lower levels of total transcripts compared with the DM Tat (Fig. 4C), suggesting that it fails to induce transcription at the same efficiency as the DM Tat.
We explored the possibility that loss of gene expression for I39Q Tat arises due to its inability to interact with an upstream transcription factor such as Sp1 or a chromatin remodeling complex such as SWI/SNF (21,39). However, co-immunoprecipitation showed no differences in Sp1 binding between I39Q and the DM Tat (Fig. 5A). A change in interaction with SWI/ SNF could alter disruption of the nucleosome (Nuc-1) situated at the transcription start site; however, nuclease sensitivity assays showed that both I39Q and the DM Tat had similar effects on Nuc-1 (Fig. 5B).
At the heart of viral gene expression is the apparent ability of Tat to affect RNAPII phosphorylation. Sequential phosphorylation of serines at positions 5 (Ser 5 ) and 2 (Ser 2 ) within an evolutionarily conserved but unstructured domain in mammalian RNAPII, consisting of 52 repeats of the heptapeptide Y 1 S 2 P 3 T 4 S 5 P 6 S 7 at its CTD, is critical for mRNA synthesis and processing (40). Normally, RNAPII recruited to the promoter of a gene is phosphorylated at Ser 5 by Cdk7 within the transcription factor complex TFIIH (41). Shortly after transcription initiation, the polymerase briefly stalls ϳ30 -40 bp downstream of the transcription start site to allow for pre-mRNA processing steps such as capping (42,43). Phosphorylation at Ser 2 by the P-TEFb complex then promotes transcriptional elongation. In HIV-1 gene expression, however, Tat directly recruits and enables P-TEFb to phosphorylate both Ser 5 and Ser 2 , and thereby greatly enhances transcriptional elongation (11,12,44).
Consistent with these results, the recently solved crystal structure of the Tat⅐P-TEFb complex shows that Tat binding induces P-TEFb conformational changes (37). To analyze the potential structural effects of mutations at sites 35 and 39, we performed additional in silico modeling based on this structure of Tat⅐P-TEFb, either in complex with or without an ATP analog molecule (ATP ϩ or ATP Ϫ ). Interestingly, both the ATP ϩ and ATP Ϫ structures of I39Q Tat are slightly energetically stabilized compared with WT Tat. In contrast, for the DM Tat the ATP ϩ structure was destabilized by 3.42 kcal/mol*, and the ATP Ϫ structure was stabilized by 2.38 kcal/mol* compared  with WT Tat (Table 1). Based on the energetics of the ATP ϩ structures, these modeling results suggest that compared with I39Q Tat, P-TEFb associated with the DM Tat may have a higher propensity to transfer the phosphate group from ATP to a substrate and transit to the more stable ATP Ϫ state. Moreover, based on the collective evidence from literature, viral transcript data, and in silico modeling results, we hypothesized that the I39Q Tat mutant, unlike the DM, may fail to efficiently induce P-TEFb-mediated phosphorylation of the CTD of RNA-PII ( Fig. 4C and Table 1) (37,44).
To explore this potential phosphorylation defect for I39Q Tat during transcriptional initiation and early elongation involved in efficient escape of RNAPII from the promoter, we performed chromatin immunoprecipitation (ChIP) with qPCR analysis to quantify the levels of total and Ser 5 and Ser 2 phosphorylated RNAPII associated with the HIV-1 promoter in the presence of different Tat variants. Interestingly, even though similar levels of total RNAPII are recruited to the viral promoter (Fig. 6C), the level of Ser-5-P-CTD of RNAPII close to the transcription start site for the I39Q Tat mutant was dramatically lower than for the DM Tat, and slightly lower than for WT Tat (Fig. 6A). Similarly, the level of Ser-2-P-CTD of RNAPII for I39Q Tat during early elongation was significantly (p Ͻ 0.05) lower than both WT and DM Tat (Fig. 6B).
Thus, it appears that the combination of weak Ser-5-P-CTD of RNAPII, strong P-TEFb binding affinity, and high Ser-2-P-CTD of RNAPII, for the WT Tat or the combination of high Ser-5-P-CTD of RNAPII, moderate P-TEFb binding affinity, and high Ser-2-P-CTD of RNAPII for the DM Tat, mediates efficient escape of RNAPII from the HIV-1 promoter and activates gene expression for these two variants (Figs. 3B and 6, A and B). Thus, it is possible that WT and DM Tat achieve similar levels of gene expression through orthogonal combinations of P-TEFb binding and Ser-5-P-CTD of RNAPII (Fig. 6D).
In contrast, although the I39Q Tat displays moderate P-TEFb binding affinity, the extremely low levels of Ser-5-P-CTD and Ser-2-P-CTD of RNAPII likely impairs its ability to activate gene expression (Figs. 3B and 6, A and B). Therefore, it appears that a Gln at site 35 (as seen for the WT and I39Q Tat) reduces the ability of Tat to induce P-TEFb-mediated Ser-5-P-CTD of RNAPII, but the presence of a Leu at site 35 (as in the Q35L and DM Tat) dramatically increases this function (Fig.  6A). However, Q35L Tat fails to bind P-TEFb and promote efficient escape of RNAPII from the promoter, as seen from the significantly lower levels (p Ͻ 0.05) of Ser-2-P-CTD of RNAPII for this mutant compared with WT and DM Tat. Thus, although the I39Q and DM Tat both have similar, but lower, P-TEFb binding affinity than WT Tat, the high Ser-5-P-CTD and Ser-2-P-CTD of RNAPII observed with the DM but not I39Q Tat apparently rescues gene expression.

DISCUSSION
For such a small protein, Tat shows a surprising diversity of function mediated by interaction with numerous cellular partners. It is post-translationally modified at specific sites by several cellular factors that impact transactivation. In addition to acetylation by PCAF and p300, there is evidence for methylation of Tat at Arg 52 and Arg 53 by the arginine methyltransferase PRMT6 (15) and methylation at Lys 51 by the lysine methyltransferase Set7/9 (KMT7). Similarly, other lysine methyltransferases have been shown to interact with Tat (16,17). Tat is also phosphorylated by Cdk2 (Ser 16 , Ser 46 ) and PKR (Ser 62 , Thr 64 , and Ser 68 ) (18,19). Further evidence of the versatility of Tat can be seen in its interaction with other cellular proteins such as SKIP/SNW1 and SWI/SNF (20,21,23). Conserved and functionally important individual sites involved in these interactions can often be identified from multiple sequence alignments. However, the identification of mutually dependent coevolving sites, which can readily be missed by simple site conservation, is enabled through the use of statistical measures such as MI.
To date, the experimental discovery of correlated sites within the HIV-1 proteome, such as in Tat, reverse transcriptase, nucleocapsid, and Rev, has involved creation of libraries of viral proteins or long term culture of HIV-1 strains with single, sitedirected mutations to reveal potential "suppressor mutations" (45)(46)(47)(48). These approaches can sometimes yield either reversion of the introduced mutation or suppressor sites that do not naturally or specifically coevolve but act in a global, independent manner to increase fitness. By comparison, statistical analysis of viral sequence databases can identify positions whose evolution is correlated in a natural or clinical setting, as well as reveal correlations between new, unanticipated amino acid pairs. Here, we have harnessed MI to demonstrate functionally important correlations between pairs of sites in Tat.
In computationally guided experiments using a model lentiviral system mimicking the positive feedback loop in HIV-1, we found that single point mutations Q35L and I39Q yielded Tat variants that failed to activate gene expression from the viral LTR, with a majority of the proviruses existing in a silenced state that was activated only upon stimulation with pharmacological agents. However, introduction of both mutations Q35L and I39Q into the same Tat protein restored gene expression (Fig. 2, C and D). Thus, the Gln 35 -Ile 39 and Leu 35 -Gln 39 residue pairs both result in efficient gene expression from the viral promoter, for two Tat subtypes (Figs. 2C and supplemental Fig.  S4C), confirming that sites 35 and 39 are coevolving. Furthermore, co-immunoprecipitation and ChIP studies revealed distinct, complementary mechanisms that constrain amino acid residues at these two sites: effective P-TEFb binding and alteration of P-TEFb substrate specificity to include the phosphorylation of Ser 5 and Ser 2 residues on the RNAPII CTD. Specifi- cally, we show that the Q35L single mutant fails to bind P-TEFb, whereas the DM partially rescues P-TEFb binding (Fig. 3). In contrast, the I39Q Tat binds P-TEFb at levels close to the DM Tat, yet still suffers from very low gene expression (Figs. 2C and 3B) potentially due to its inability to induce P-TEFb to phosphorylate the CTD of RNAPII (Fig. 6, A and B). It is plausible that the inability of the I39Q Tat to induce efficient phosphorylation of the CTD of RNAPII involves loss of interaction with additional host factors that remain to be discovered. At any rate, unlike most coevolving or suppressor mutations that help restore a single biological function (49) 39 residue pair. Based on the P-TEFb binding assay and levels of Ser-5-P-CTD of RNAPII, it appears that different subtypes could potentially have evolved alternate modes or "solutions" to inducing gene expression from the viral LTR. Subtype B Tats induce a low level of Ser-5-P-CTD in RNAPII FIGURE 6. I39Q Tat fails to efficiently induce phosphorylation of the CTD of RNAPII, and the subtypes of Tat have potentially evolved alternate modes of inducing viral gene expression. A-C, ChIP for Ser-5-P-CTD of RNAPII, Ser-2-P-CTD of RNAPII and total RNAPII close to the transcription start site. Although similar levels of RNAPII are recruited to the viral promoter for all Tat variants, the I39Q Tat apparently fails to induce P-TEFb to efficiently phosphorylate the CTD of RNAPII, unlike the DM and WT Tat. Controls were performed without antibody. All qPCR measurements are in triplicate, and error bars represent S.D. * denotes statistically significant differences (p Ͻ 0.05) between the indicated pairs of Tat variants. D, plot of Ser-5-P-CTD of RNAPII versus CycT1 binding, and Ser-2-P-CTD of RNAPII versus CycT1 binding, for different Tat variants. The black and red symbols correspond to the levels of Ser-5-P-CTD of RNAPII and Ser-2-P-CTD of RNAPII for different Tat variants, respectively. The blue oval encompasses the Q35L and I39Q Tat variants that fail to activate gene expression, either due to its inability to bind P-TEFb or due to its failure to induce P-TEFb to efficiently phosphorylate the CTD of RNAPII. The green oval shows that the WT and DM Tat activate gene expression, although potentially through different mechanisms. Markers within the green oval show that subtype B Tat (WT Tat) displays high P-TEFb binding affinity and low Ser-5-P-CTD of RNAPII, whereas the DM Tat, mimicking most subtype C Tats at sites 35 and 39, shows moderate P-TEFb binding and high Ser-5-P-CTD of RNAPII, with both Tat variants displaying comparable levels of Ser-2-P-CTD of RNAPII. (Fig. 6A), but the strong binding to P-TEFb could at least in part compensate for this deficit (Fig. 3B) and eventually produce efficient elongation, as can been seen from the high levels of Ser-2-P-CTD of RNAPII, a marker for P-TEFb-induced elongation (supplemental Fig. S8). In contrast, the DM Tat, which mimics the majority of subtype C Tats at sites 35 and 39, induces very high levels of Ser-5-P (Fig. 6A) coupled with relatively weaker P-TEFb binding (Fig. 3B) that in combination could ultimately drive comparable levels of Tat-mediated gene expression as measured by GFP expression (Fig. 2C), viral transcript analysis, and Ser-2-P-CTD of RNAPII (supplemental Fig.  S8). Thus, the diversification of HIV-1 into different subtypes has apparently resulted in the evolution of compensatory mechanisms to trade off substrate binding and catalytic activity in inducing Tat-mediated gene expression from the viral LTR, such that the overall activity may be determined by the combination or "sum" of contributions from individual positions or functions (Fig. 6D). This novel finding has some parallels with other biological systems. For instance, it has been shown previously that autophosphorylation mutants of the epidermal growth factor (EGF) receptor stimulate similar levels of MAP kinase activation, gene expression, and mitogenesis as the WT EGF receptor though different compensatory mechanisms (50).
We have previously found that gene expression from the LTR is a stochastic process, with bursts of mRNA production separated by long intervals, a feature that could play an important role in the establishment of viral latency (35,36,51). Changes to Tat that distinctly affect transcriptional initiation or elongation could differentially impact the frequency and size of mRNA bursts. Gene expression data at low Tat levels indicates that Tat variants from different subtypes, with potentially alternate mechanisms for inducing gene expression, could impact probabilistic gene expression events (supplemental Fig. S6). Future work may explore whether these differences in Tat result in different propensities for viral latency. In addition to its application to HIV-1, such an integrated computational and experimental approach could readily be extended to other pathogens to gain deeper insights into their function and evolution, as well as potentially aid in the rational development of novel therapeutic strategies.