Structural Basis for Sequence-specific DNA Recognition by an Arabidopsis WRKY Transcription Factor*

Background: The WRKY transcription factors recognize the W-box DNA element in target genes. Results: We determined the NMR solution structure of the WRKY DNA-binding domain of Arabidopsis WRKY4 in complex with W-box DNA. Conclusion: Apolar contacts by residues in the conserved WRKYGQK motif with thymine methyl groups are important in recognition of the W-box sequence. Significance: This is the first structure of the WRKY-DNA complex. The WRKY family transcription factors regulate plant-specific reactions that are mostly related to biotic and abiotic stresses. They share the WRKY domain, which recognizes a DNA element (TTGAC(C/T)) termed the W-box, in target genes. Here, we determined the solution structure of the C-terminal WRKY domain of Arabidopsis WRKY4 in complex with the W-box DNA by NMR. A four-stranded β-sheet enters the major groove of DNA in an atypical mode termed the β-wedge, where the sheet is nearly perpendicular to the DNA helical axis. Residues in the conserved WRKYGQK motif contact DNA bases mainly through extensive apolar contacts with thymine methyl groups. The importance of these contacts was verified by substituting the relevant T bases with U and by surface plasmon resonance analyses of DNA binding.

The WRKY transcription factor proteins have been identified from a wide range of higher plants (1)(2)(3)(4) and compose one of the largest families of plant-specific transcription factors (5,6). Most WRKY proteins are involved in responses to biotic or abiotic stresses, such as pathogenic infection, injury, heat, drought, and high salinity (3,(7)(8)(9)(10)(11)(12). The WRKY proteins control transcription of the target genes by binding to the promoter regions that contain a DNA element called the W-box with the core sequence TTGACY (where Y is C or T) (3,4,7).
An ϳ60-amino acid DNA-binding domain called the WRKY domain is shared by the WRKY family, and it contains an invariant sequence, WRKYGQK, and a zinc-binding motif (5). The WRKY proteins that possess two WRKY domains are classified into group I, whereas those that possess a single WRKY domain are classified into group II or III, mainly according to the amino acid sequences of the zinc-binding motif (5). For the group I WRKY proteins, the C-terminal WRKY domain (not the N-terminal domain) is responsible for the recognition of the W-box sequence (1,4,7,13).
We previously determined the solution structure of the C-terminal WRKY domain of the Arabidopsis WRKY4 protein (AtWRKY4-C) 2 that comprises a four-stranded ␤-sheet (14). Furthermore, a crystal structure of the equivalent domain of the Arabidopsis WRKY1 protein has been reported, and it is very similar to our AtWRKY4-C structure except that it contains an additional ␤-strand at the N terminus of the domain (15). From the results of NMR titration analysis, we proposed a structural model for the complex between AtWRKY4-C and DNA in which the strand containing the conserved WRKYGQK sequence enters the major groove of DNA (14). Mutational analyses have indicated that this sequence is important for DNA binding (13,15,16). However, it was previously unknown how the WRKY domains specifically recognize the W-box sequence.
In this study, the three-dimensional structure of the complex of AtWRKY4-C with DNA containing the W-box sequence was determined by NMR spectroscopy. DNA bases were recognized mainly through apolar contacts involving the methyl groups of the T bases. The importance of the apolar contacts was verified by substituting T bases with U and by binding analyses using surface plasmon resonance (SPR).

EXPERIMENTAL PROCEDURES
Sample Preparation-The 13 C/ 15 N-labeled, 15 N-labeled, and unlabeled AtWRKY4-C (Val-399 -Ala-469) proteins were produced by cell-free protein synthesis with optimization for zincbinding proteins (17,18) and were partially purified by immobilized metal affinity chromatography using an automated system as described previously (19). The eluted protein was cleaved with tobacco etch virus protease to remove the His tag and subsequently exchanged into 20 mM Tris-HCl buffer (pH 8.0) containing 300 mM NaCl, 5 mM imidazole, 1 mM iminodiacetate, and 50 M ZnCl 2 using a HiPrep 26/10 desalting column (GE Healthcare). The protein-containing fraction was applied to a HisTrap column (GE Healthcare), and its flowthrough fraction was pooled. [ 15 N]Thy-labeled and unlabeled 16-mer double-stranded DNAs (5Ј-CGCCTTTGACCA-GCGC-3Ј/5Ј-GCGCTGGTCAAAGGCG-3Ј, where the W-box core sequence is underlined) were chemically synthesized (Tsukuba Oligo Service), and [ 15 N]thymidine phosphoramidite (CIL International, Andover, MA) was used to label the five T bases in the sequence. The concentration of the protein was determined at A 280 with an extinction coefficient estimated from the amino acid sequence (20), whereas that of the doublestranded DNA was determined at A 260 using an extinction coefficient that was calculated after digestion of the strands with phosphodiesterase I (Worthington). For the NMR measurements, 0.4 -1.0 mM 1:1 protein-DNA complex was dissolved in 20 mM potassium phosphate buffer (pH 6.0) containing 200 mM KCl, 20 M ZnCl 2 , 1 mM deuterated dithiothreitol (Isotec Inc.), 0.05 mM sodium 2,2-dimethyl-2-silapentane-5-sulfonate, and 5% D 2 O unless stated otherwise. For the measurement of residual dipolar couplings (RDCs), 12 mg/ml Pf1 phage (ASLA Biotech, Riga, Latvia) (21) was added. Further descriptions regarding selection of the buffer system are provided under supplemental "Materials." NMR Analyses and Structure Determination-NMR spectra were recorded on a Bruker DMX-750 (750.13 MHz for 1 H and 76.02 MHz for 15 N) or DMX-500 (500.13 MHz for 1 H, 125.76 MHz for 13 C, and 50.68 MHz for 15 N) spectrometer at 298 or 303 K. The protein backbone and side chain resonance assignments were partly obtained from the previous DNA titration experiments (14) and completed here by a series of triple-resonance experiments (22). The resonances of DNA were assigned using the typical base-sugar connectivities for DNA strands appearing in NOESY spectra at a mixing time of 100 ms (supplemental Fig. S1) (23). Chemical shifts are referenced to internal sodium 2,2-dimethyl-2-silapentane-5-sulfonate directly ( 1 H) or indirectly ( 13 C and 15 N) as recommended in the Biological Magnetic Resonance Data Bank. The completeness of the assignments in the region used for the structure determination was 93.1% for non-labile and backbone amide protons of the protein or 97.4% for non-labile protons of DNA excluding H4Ј, H5Ј, and H5Љ atoms. A heteronuclear single-quantum correlation (HSQC) spectrum of a sample dissolved in 99.96% D 2 O (Isotec Inc.) after lyophilization was recorded at 283 K to identify the hydrogen bonds in the protein.
For evaluation of 1 H-15 N RDCs, the in-phase/anti-phase (IPAP) HSQC (24) spectra were recorded, where the IPAP subspectra were obtained in an interleaved manner. The two subspectra were processed by addition or subtraction to yield the other two subspectra containing upfield or downfield components of the cross-peaks. The RDCs for the protein backbone amides, arginine side chain N⑀-H⑀ groups, and DNA T base imides were obtained in separately measured spectra, where the parameters were optimized so as to maximize the intensities of the respective peaks. The structure was calculated using the CNS program (25) as described under supplemental "Materials." Structure Analyses-Intermolecular contacts were analyzed by in-house Fortran programs and Insight II (Accelrys, San Diego, CA). A hydrogen bond was defined by a hydrogen donor and acceptor distance that was Ͻ3.5 Å and a donor-hydrogenacceptor angle that was Ͼ110°. An electrostatic attraction was defined by a distance between a side chain nitrogen of Arg/Lys and a phosphate oxygen that was Ͻ5.0 Å. An apolar contact representing both hydrophobic and van der Waals forces was defined by a C-C distance that was Ͻ5.0 Å.

RESULTS AND DISCUSSION
Structure Determination-To determine the structure, we used a WRKY domain and a 16-mer double-stranded DNA with the same sequences as those used for previous protein structure determination and DNA titration analysis, for which a 1:1 binding stoichiometry was revealed by SPR (14). Assignments of the protein backbone resonances in the complex have been essentially completed by the previous DNA titration anal-ysis and three-dimensional NMR analyses. Therefore, in this study, resonance assignments of the protein side chains and DNA chains were performed. In the spectra of the complex, intermolecular NOEs were observed, and these directly defined the geometry of the molecular interface (supplemental Fig. S1).
In addition to the distance-and dihedral angle-based restraints, we employed RDCs (26,27) to restrain the relative angles between the bond vectors and the overall molecular alignment axes. For this purpose, 15 N-labeled T bases were introduced into the DNA, and IPAP-HSQC spectra (24) were measured for the complex with uniformly 15 N-labeled protein with or without the partial alignment induced by filamentous phage (Fig. 1) (21). The complex structure was calculated in a sequential manner as described under supplemental "Materials." The structure thus obtained satisfied the experimental restraints, possessed idealistic stereochemical properties, and showed a good convergence (Fig. 2a, Table 1, and supplemental Fig. S2). The z axis of the alignment tensor generated by the phage was found to be approximately parallel to the DNA helical axis, whereas the x axis, which is the longer of the two rhombic axes, was nearly parallel to the vector from the center of the DNA to that of the protein (Fig. 2a). Therefore, the RDCs appeared to restrict the individual 1 H-15 N vectors relative to the axes that were essentially defined by the DNA helical structure and the protein-DNA interaction.
Structure of the Complex-The structure of the protein moiety of the complex consists of a four-stranded ␤-sheet (␤1, Trp-414 -Val-422; ␤2, Tyr-427-Thr-436; ␤3, Cys-439 -Arg-447; and ␤4, Val-455-Glu-460), which is similar to that of the protein not bound to the DNA (14), with a backbone root mean square deviation of 1.9 Å (Fig. 2, b and c). The DNA is in the B-form with a slight bent toward the protein. The ␤-sheet plane is almost perpendicular to the DNA helical axis but is slightly tilted to fit the rim of the sheet into the major groove. The rim strand, i.e. the ␤1-strand that contains the invariant WRKYGQK sequence, composes the major molecular interface. The revealed binding mode, which is largely consistent with the previous model that was based on the DNA titration analysis (14), is called a ␤-wedge in this study, as described below.
It has been suggested that Gly-418 of the WRKYGQK sequence was irregularly inserted into the typical antiparallel ␤-sheet, and this induced a kink in this strand (14). The present structure revealed that the kink created the convex curvature of the ␤-sheet rim and thereby enabled the close contact of this strand with DNA bases. In addition, the formation of the complex significantly altered the relative position of this strand to the others (Fig. 2c). This appeared to influence the structure of the loop connecting the ␤1and ␤2-strands and that connecting the ␤3and ␤4-strands as well and thereby slightly altered the length of the strands.
Contacts in the Molecular Interface-Eight bases in 7 consecutive bp are contacted by Arg-415, Lys-416, Tyr-417, Gly-418, Gln-419, and Lys-420 of the ␤1-strand or all of the residues in the invariant WRKYGQK sequence, except Trp-414, through apolar and hydrogen-bonding interactions (Fig. 3a). All of the contacts, including those with the sugar phosphate backbone, cover the range of 8 bp (Fig. 3a). The details are described below.
The side chain carbon atoms of Arg-415 form extensive apolar contacts with the T5 base carbons, mainly of the methyl groups, and the backbone sugar carbons of C4. At the same time, it is possible for the guanidyl group of Arg-415 to form electrostatic interactions with the A7Ј and C8Ј phosphates.
The side chain and backbone carbon atoms of Lys-416 form apolar contacts with the T6 and T7 methyl groups and T5 sugar atoms. At the same time, the backbone amide and side chain amino groups of Lys-416 form hydrogen bonds with the phosphates of T5 and T6, respectively (Fig. 3b). The hydrogen bonding by the backbone amide is consistent with the previous observation that binding to the DNA induced a very large downfield shift (ϳ1.8 ppm) of the amide proton resonance (14). In addition, the hydrogen bond between the Lys-416 amino and T6 phosphate groups is strengthened by electrostatic attraction. This amino group also forms an electrostatic interaction with the T7 phosphate.
The aromatic rings of Tyr-417 and Tyr-431 and the backbone of Gly-418 surround the T9Ј methyl group and form extensive apolar contacts (Fig. 3, a and c); this is consistent with the observed intermolecular NOEs and the upfield-shifted resonances of the T9Ј base protons (supplemental Fig. S1). Moreover, the aromatic carbons of Tyr-417 contact the T6 base, T7 base (mainly the methyl groups), C8Ј base, and T9Ј sugar carbons by apolar interactions, and the hydroxyl group of the same residue simultaneously contacts the T9Ј phosphate by hydrogen-bonding interactions. Gly-418 also forms extensive apolar contacts with the T7, G8, and G8Ј base carbons, which is enabled by the deep entrance of this residue into the DNA groove.
The side chain amide nitrogen and side chain/backbone carbons of Gln-419 form a hydrogen bond with the T7 phosphate and apolar contacts with the T7 base, respectively. Lys-420 contacts the G10Ј base, G11Ј base, and G11Ј sugar carbons by apolar interactions and simultaneously forms hydrogen bonds with the N7/O6 atoms of G10Ј and/or the phosphate oxygens of G11Ј. In addition, it contacts the phosphates of G10Ј and/or G11Ј by electrostatic attraction.
In addition to the above residues, Arg-413, Lys-423, Arg-429, Lys-433, and Arg-442 form hydrogen bonds and/or electrostatic contacts with phosphate groups (Fig. 3a). Arg-429 and Lys-433 are located on the same strand, i.e. ␤2-strand, but protrude to the opposite sides of the ␤-sheet to each other. Arg-442 is located on the ␤3-strand, considerably distant from the major contacting strand, i.e. ␤1-strand. These contacts appear to be enabled after the close fitting of the ␤-sheet to the DNA groove and therefore to contribute significantly to the fixing of the binding geometry.
Verification of Importance of Apolar Contacts with T Bases-As described above, the recognition of the W-box sequence by AtWRKY4-C was achieved mainly by extensive apolar contacts with the methyl groups of the T bases (Fig. 3). We verified the importance of these contacts by substitution of the T bases with U (namely, elimination of the methyl groups) and evaluation of binding affinities by SPR ( Fig. 4 and Table 2). The equilibrium SPR response value appeared to reach the maximum of that expected when the proteins bound to all of the immobilized DNAs at a 1:1 stoichiometry (Fig. 4a, left panel, arrow) or slightly more for the 16-mer W-box DNA without substitution. A binding constant of 1.9 ϫ 10 7 M Ϫ1 was obtained by fitting the data to the simple 1:1 binding model ( Fig. 4b and Table 2). In contrast, binding to the DNA by substitution of T9Ј with U appeared too weak to reach a plateau level within the protein concentration range of the present experiment (Fig. 4a, right  panel), for which an ϳ25-fold weaker binding constant was obtained by data fitting (Fig. 4b and Table 2). These results verified the importance of the extensive apolar contacts involving the T9Ј methyl group (Fig. 3c). NMR revealed that even in the relevant weak complex, the framework of protein-DNA interaction was probably conserved (supplemental Fig. S1).
In addition, the substitution of T6 or T7 significantly decreased the affinity by 11-or 2.3-fold (Table 2). T6 and T7, as well as T9Ј, are the conserved T bases in the W-box core sequence (TTGACY) and are each contacted by two amino acids or more of AtWRKY4-C (Fig. 3a). The other T bases in the DNA used, i.e. T5 and T12Ј, which are outside of the W-box sequence, showed no sensitivity to the substitution (Fig. 4b and Table 2). T12Ј is not contacted by the domain, whereas T5 is contacted by only a single residue in AtWRKY4-C (Fig. 3a).
Implications for Preference for DNA Bases-At the position upstream of the W-box core sequence (TTGACY), i.e. the base equivalent to T5 of the present DNA, a G base is most preferred (appreciably better than A or T), as shown by a systematic study on the promoter sequences of the target genes (16). A simple model based on the present complex structure showed that Arg-415 could form a hydrogen-bonding contact with the G base at this position (supplemental Fig. S3). This contact is essentially in the manner typical of the recognition of the G base by Arg (28) and may explain the preference for DNA bases.
This preference is significant for AtWRKY6 and AtWRKY11, which belong to groups IIb and IId, respectively, but not for AtWRKY26, AtWRKY38, and AtWRKY43, which belong to groups I, III, and IIc, respectively, as revealed by experiments using base substitutions (16). It should be noted that, for AtWRKY26 and AtWRKY43, Arg is conserved at a position that is equivalent to Arg-413 of AtWRKY4, but the equivalent residue is Gln or Ser for AtWRKY6 and AtWRKY11. Therefore, we suggest the possibility that the hydrogen-bonding contact between Arg-413 and a phosphate (Fig. 3a), which is reinforced by electrostatic attractions, ensures the stability of the complex without the contact between Arg-415 and the G base. In con-  Fig. 1a), which were drawn looking from the major groove side of DNA. DNA bases, the sugar backbone, and phosphates are shown by squares, black lines, and yellow circles, respectively. Bases contacted by the protein are highlighted in red. Cyan, yellow, and green lines represent hydrogen bonds, contacts by electrostatic attraction, and apolar contacts, respectively, as defined under "Experimental Procedures." Solid and dashed lines indicate that these contacts were identified in Ն75% and Ն40% of the structures, respectively. b, hydrogen bonds formed by backbone and side chain nitrogen atoms of Lys-416 are indicated by dashed cyan lines. The average and minimal (in parentheses) distances between donor nitrogen and acceptor oxygen atoms in the selected structures are shown. c, apolar contacts with the methyl group of T9Ј (dashed green lines), with the average and minimal (in parentheses) C-C distances indicated. The figures in b and c were produced using the Insight II molecular display program.
trast, Gln or Ser at this position is not capable of gaining electrostatic attraction but is capable of forming hydrogen bonds with phosphate, as simply modeled based on the present com-plex structure (data not shown). A basic residue, i.e. Arg or Lys, is conserved at this position in all group I and IIc WRKY proteins, and Gln and Ser are conserved in group IIb and IId proteins, respectively (5). Therefore, the above explanation is valid for all proteins belonging to these WRKY groups. However, AtWRKY38 possesses Leu at this position, and it may not form a hydrogen bond or gain electrostatic attractions. This protein, as well as other group III WRKY proteins, possesses basic residues in several different sites compared with the proteins of other WRKY groups and therefore may possess a slightly different binding framework, which consequently may not require the Arg-G contact. As discussed above, the differences in amino acids (particularly those capable of contacting DNA) among the WRKY groups may lead to a preference for bases that flank the TTGACY core motif and thereby enable selective and concerted control of many target genes of the WRKY family of transcription factors.
Implications for Requirement of Amino Acids-The involvement of the WRKYGQK sequence in DNA binding was investigated by mutational experiments (13,15,16). Among the residues, Trp, Tyr, and two Lys residues were clearly indicated as indispensable in DNA binding (13,15). The Trp residue, i.e. Trp-414 of AtWRKY4 (residue numbers in AtWRKY4 are used for the equivalent residues in this section), forms the structural core of the domain (14), so the mutation to Ala (13) may disrupt the correct structure that is necessary for DNA binding. Lys-416, Tyr-417, and Lys-420 directly contact DNA bases and/or the sugar phosphate backbone (Fig. 3), and mutations of Lys-416 to Ala, Tyr-417 to Ala or Arg, and Lys-420 to Ala abolished DNA-binding activity (13,15). It should be noted that mutations of Tyr-417 to Phe (15,16) and Lys-416 to Arg (15) only partially impaired the activity, which is consistent with the present complex structure in that the Tyr-to-Phe mutation should maintain the apolar contacts between the aromatic ring and the T9Ј methyl group and in that the Lys-to-Arg mutation maintains hydrogen-bonding/electrostatic interactions with the phosphate. Furthermore, the importance of Gly-418 was clear, as large decreases were observed in affinity because of a mutation to Ala or Phe (13,15). Probably, the addition of side chain disables the deep entrance of Gly-418 into the DNA groove.
In contrast, mutations of Arg-415 or Gln-419 did not affect complex formation (13,15), although contacts involving these    MARCH 2, 2012 • VOLUME 287 • NUMBER 10 residues were observed in the present complex structure (Fig.  3a). These results indicate that the contacts by the above residues are not indispensable for forming the complex, at least under the conditions of the relevant electrophoretic experiments. In addition, in this study, substitution of the T5 base with U did not impair the affinity (Fig. 4 and Table 2), indicating that the apolar contacts between Arg-415 and the T5 methyl group are not indispensable. However, we hypothesize that, under a cellular condition that would be considerably different from the experimental ones, contacts, including the presumable contact between Arg-415 and the G base (supplemental Fig.  S3), may contribute to ensuring the binding and the preference for DNA bases.

Structure of the WRKY-DNA Complex
In addition to the WRKYGQK residues, Arg-429 and Lys-433 are important in the binding (15), which is consistent with the present structure (Fig. 3a). In particular, it was demonstrated that Arg and Lys were interchangeable at position 433 (15), and both could form hydrogen bonds and electrostatic attractions simultaneously, as observed in the present structure.
DNA-binding Domains Containing ␤-Sheets as the Molecular Interface-The majority of the DNA-binding domains utilize ␣-helices to contact DNA, whereas a relative minority of them utilize ␤-sheets. Many of the latter, such as MetJ-Arc repressors (29,30), the AtERF1 ERF domain (Fig. 5a) (31), Tn916 integrase (32), and the THAP zinc finger (Fig. 5b) (33,34), have two-or three-stranded ␤-sheets that fit into the major groove of DNA. For these, the ␤-sheet plane is approximately parallel to the DNA helical axis around the molecular interface. The contacting amino acid side chains are located on one side of the ␤-sheet plane that is supported by ␣-helices on the other side.
In contrast, the WRKY domain presented here and the GCM domain ( Fig. 5c) (35) utilize four-or five-stranded ␤-sheets, which are too large to fit into the DNA groove in the manner described above. Instead, ␤-sheets enter the groove with planes that are approximately perpendicular to the DNA helical axis. The proteins utilize the rim of the ␤-sheet, so the side chains that are located on both sides of the plane are involved in the contacts. An ␣-helix-supporting ␤-sheet was not identified for the WRKY domain (14,15), whereas a short ␣-helix appeared to support the relevant ␤-sheet of the GCM domain, although it was located apart from the molecular interface. We refer to this atypical binding mode as a ␤-wedge because the ␤-sheet appears to sharply cut into the DNA groove.
Together with the similarity in the arrangement of the zincbinding residues, it has been proposed that the WRKY and GCM domains are evolutionarily related, although the latter is shared only by animals (14,36,37). It should be pointed out that the ␤1-strand (the rim strand) of the WRKY domain is kinked in the middle by the insertion of Gly-418, whereas that of the GCM domain is short enough to be included in the major groove without a kink.
For the NAC domain, which composes another major family of plant-specific transcription factors, the ␤-sheet structure possesses a basic rim strand with a kink that was induced by the insertion of a Gly residue (38), which is similar to the WRKY domain. Therefore, NAC may adopt the ␤-wedge mode of DNA binding and may be evolutionarily related to the WRKY and GCM transcription factors.