Subdivision of the Helix-Turn-Helix GntR Family of Bacterial Regulators in the FadR, HutC, MocR, and YtrA Subfamilies*

Haydon and Guest (Haydon, D. J, and Guest, J. R. (1991) FEMS Microbiol. Lett. 63, 291–295) first described the helix-turn-helix GntR family of bacterial regulators. They presented them as transcription factors sharing a similar N-terminal DNA-binding (d-b) domain, but they observed near-maximal divergence in the C-terminal effector-binding and oligomerization (E-b/O) domain. To elucidate this C-terminal heterogeneity, structural, phylogenetic, and functional analyses were performed on a family that now comprises about 270 members. Our comparative study first focused on the C-terminal E-b/O domains and next on DNA-binding domains and palindromic operator sequences, has classified the GntR members into four subfamilies that we called FadR, HutC, MocR, and YtrA. Among these subfamilies a degree of similarity of about 55% was observed throughout the entire sequence. Structure/function associations were highlighted although they were not absolutely stringent. The consensus sequences deduced for the DNA-binding domain were slightly different for each subfamily, suggesting that fusion between the D-b and E-b/O domains have occurred separately, with each subfamily having its own D-b domain ancestor. Moreover, the compilation of the known or predicted palindromiccis-acting elements has highlighted different operator sequences according to our subfamily subdivision. The observed C-terminal E-b/O domain heterogeneity was therefore reflected on the DNA-binding domain and on the cis-acting elements, suggesting the existence of a tight link between the three regions involved in the regulating process.

dimers, 2-fold symmetric DNA sequences in which each monomer recognizes a half-site. This group is now considered as a reference for understanding the general rules that govern protein-DNA interactions (9,10) and has also become a favorite target for evolutionary studies (8,11).
Among HTH transcriptional regulators, families have been identified throughout sequence comparisons and phylogenetic, structural, and functional analyses focused on DNA-binding domains and almost exclusively on the HTH structure, which is the only active motif that shows strong similarities among all members of the group (1, 4, 6 -8, 11). These comparative studies have led to the determination of a specific HTH consensus pattern or signature for each family, providing the basis for a simple method of classification and detection of new members (12).
The lack of significant similarity among regions involved in effector binding or oligomerization systematically excludes these domains during families signature establishment, although they have important roles in the regulating process. In fact, it is often the oligomerization between regulatory subunits and/or the conformational changes due to the binding or the removal of the inducing/repressing molecule that allows correct HTH motif disposition and the subsequent DNA binding ability of the whole regulatory protein. The link between the two regions is therefore more intimate than it first appears from a unique amino acids comparison and may also be reflected in the DNA operator sequences, the third structural element involved in gene regulation.
To argue for the existence of a link between regions involved in the regulating process, we analyzed the HTH GntR family of bacterial regulators. As determined thus far, the family comprises about 270 members distributed among the most diverse bacterial groups and regulating the most various biological processes. This family was first described by Haydon and Guest in 1991 (1) and was named after GntR, the repressor of the gluconate operon in Bacillus subtilis (13,14). Our interest in the properties of these bacterial regulators arises from the identification by our laboratory of the xlnR gene (15) in which chromosomal disruption in Streptomyces lividans relieves various extracellular enzymatic systems from glucose repression.
The first purpose of this report is to present, 10 years after the first comparative study, an update of the GntR family description. Moreover, we decided to analyze the full-length sequence of the proteins through amino acid comparisons, secondary structure predictions, phylogenetic tree construction, and functional analysis in order to find hidden specific characteristics among the regions that are generally not considered. Analyses that extended to the regions outside of the DNAbinding domain could lead to a more precise family signature and should define the subfamilies.

EXPERIMENTAL PROCEDURES
Selection of GntR-like Members-Members of the GntR family were identified from the SWISS-PROT/TrEMBL/GenBank TM sequence data bases (last update, June 2001) by a keywords search on the ExPASy molecular Biology server and NCBI server. 2 All sequences proposed by the data bases as belonging to the GntR family were used as query sequences for a BLAST search to verify their N-terminal DNA-binding domain homology to other GntR-like regulators. Incorrectly GntR-like classified proteins by sequence data bases, i.e. the Irr protein from Bradyrhizobium japonicum (16), were rejected from our comparative study. Fragment of sequences were rejected too. We finally collected and analyzed about 270 members. For ease and usefulness of presentation, the best studied regulators (13)(14)(15), most representative members, or proteins yielding data of specific interest were selected for publication. The 56 proteins discussed and presented in this paper are listed in Table I.
Secondary Structure Predictions-To identify homologous C-terminal sequences within the HTH GntR family, we started our comparative study from the level of the secondary structures, in which conservation is known to be less eroded during evolution. Secondary structure predictions result from the compilation of PSI-pred, Predict Protein, Sspro, and Jpred automated prediction programs on the PredictProtein server. 3 To improve the validity of our consensus prediction approach, we compared the theoretical model that we obtained for FadR (fatty acidresponsive regulator in Escherichia coli) to its experimentally resolved tertiary structure (52,53). The method was revealed to have an accuracy of Ͼ90% for FadR with most of the inaccuracies occurring at the 2 Found on the Web at www.expasy.ch and www.ncbi.nlm.nih.gov, respectively. 3 Found on the Web at dodo.cpmc.columbia.edu. Multiple Alignments and Phylogenetic Tree Construction-Multiple alignments were developed with the MULTIALIN (54) and CLUST-ALW (55) 4 programs, included in the ExPASy multiple alignment tool, followed by manual improvement by eye according to the predicted secondary structures. The advantage of these alignments resides in the integration of the structural reality of the proteins. Distances between aligned proteins were computed with the PRODIST program using maximum likelihood estimates on the Dayhoff PAM matrix (56). The FITCH program estimated phylogenies from distances in the matrix data using the Fitch-Margoliash algorithm (57), and phylogenetic trees were drawn using the TREEVIEW program (58). PRODIST and FITCH programs are included in the PHYLIP package developed by Feldenstein (59).

RESULTS
As mentioned by Haydon and Guest (1), members of the GntR family of bacterial regulators share similar N-terminal DNA-binding domains, but high heterogeneity has been observed among the various C-terminal effector-binding and oligomerization domains. In order to elucidate the C-terminal dissimilarity, the characterization of the N-and C-terminal domains was done separately.
The C-terminal Effector-binding and/or Oligomerization Domain-The construction of a phylogenetic tree deduced from the full-length multiple alignment of GntR-like members revealed that the C-terminal heterogeneity was limited to four E-b/O types. In fact, we can see in Fig. 1 four major and distinct clusters of branches. By the same way, two-dimensional structural predictions revealed four major types of E-b/O structural domain topologies (Fig. 2, a-d) with discrete variants in each subfamily and very few proteins (7%) escaping from this subdivision. The presence of four major types of C-terminal topologies suggests at least four different E-b/O domain donor-ancestors for the fusion to a common type of DNA-binding domain. Once the fusion occurred between the two domains, the high similarity level (55%) calculated suggests that proteins within a subfamily arose by duplication events.
The first GntR subfamily, which we called FadR, is the most represented one as it regroups 40% of GntR-like regulators. In this subfamily, the proteins consist of an all-helical C-terminal domain ( Fig. 2a) with seven or six ␣-helices for the FadR and VanR subgroups, respectively. VanR-like regulators certainly derive from FadR-like proteins, as they only diverge by the loss of the first ␣-helix (␣ 4 ). The average C-terminal length of the FadR and VanR subgroups is, respectively, about 170 and 150 amino acids. The crystal structure of the C-terminal domain of FadR (Protein Data Bank code 1EX2) has been determined (52,53) and, according to our comparative study, its relative threedimensional data could be used as a scaffold to orient studies to the entire subfamily. Most of the FadR-like proteins are involved in the regulation of oxidized substrates related to amino acids metabolism or at the crossroads of various metabolic pathways such as aspartate (AnsR), pyruvate (PdhR), glycolate (GlcC), galactonate (DgoR), lactate (LldR), malonate (MatR), or gluconate (GntR).
In the second proposed subfamily, the C-terminal domain contains both ␣-helical and ␤-sheet structures arranged as shown in Fig. 2b. The subfamily is named HutC and comprises 31% of GntR-like regulators among which the cluster of proteins involved in conjugative plasmid transfer in various Streptomyces species (i.e. KorSA, KorA, and TraR proteins). The average length of the C-terminal domain is about 170 amino acids and, so far, no three-dimensional structural data on it are available. In this subfamily, the conservation of the structural elements has been altered at several positions (see for instance, ␤ 3, ␣ 7 , and ␤ 6 in Fig. 2b). The observed altered E-b/O topology could be the result of structural accommodation in response to the most diverse biological processes regulated by HutC-like members.
In the third subfamily, called MocR, the E-b/O domain is immediately distinguishable from others because of its exceptional average length of about 350 amino acids and its homology to the class I of aminotransferase proteins (61) (see Fig. 2d). These proteins catalyze the reversible transfer of an amino group from the amino acid substrate to an acceptor ␣-keto acid. They require pyridoxal 5Ј-phosphate (PLP) as a cofactor to catalyze this reaction. Transamination reactions are of central importance in amino acid metabolism and in links to carbohydrate and fat metabolism. This class of aminotransferases acts as dimers in a head-to-tail configuration (62). Each subunit binds one molecule of PLP through an aldimide linkage with the ⑀-amino group of the conserved lysine residue in the PLP attachment site. The observed modular association to an aminotransferase-like C-terminal domain suggests that similar dimerization should occur in MocR-like proteins and that PLP is required as a cofactor for their regulating activity. The most relevant evidence comes from PdxR in Streptomyces venezuelae, which is involved directly in the regulation of pyridoxal phosphate synthesis (47).
The fourth subfamily possesses a reduced C-terminal domain with only two ␣-helices (Fig. 2c). The subfamily, that we called YtrA, is the less represented with only 6% of GntR-like regulators, most of these forming part of operons involved in ATPbinding cassette (ABC) transport systems. As it emerges from the alignment of YtrA-like proteins (Fig. 2c), the weaker identity observed between members suggest that the C-terminal domain has undergone some molecular recombinations or that the origins of the E-b/O domain could be multiple. The average length of the putative E-b/O domain is about 50 amino acids, and according to Yoshida et al. (49), this length should be too small to accommodate effector binding. Dimerization should remain possible, as numerous GntR-like palindromic operator sequences have been observed in the corresponding upstream regions (see "Operator Site Analysis" below). The presence of  Table I. GntR-like regulators were classified in four subfamilies according to the four clusters of branches that emerged from the constructed tree and reflecting the observed C-terminal structural topology. many positively or negatively charged as well as hydrophobic and aromatic residues at the end of the domain suggests that dimer formation should occur through classical salt bridges and side-chain-side-chain hydrophobic interactions.
The DNA-binding Domain-As shown in Fig. 3, structural predictions revealed that the DNA-binding (D-b) domain topology of the whole GntR family is rather well conserved and all of the secondary structure elements are in similar relative positions. It consists of three ␣-helices and two (sometimes three) ␤-sheets disposed as follow: ␣ 1 ␣ 2 ␣ 3 ␤ 1 ␤ 2 . According to FadR structural data, we can consider that the N-terminal DNAbinding domain of all GntR-like members contains a small ␤-sheet core and three ␣-helices, the HTH motif being formed by helices ␣ 2 and ␣ 3 .
The average amino acids identity obtained for the DNAbinding domain of the entire GntR-family is about 25%. The level obtained is relatively low compared, for instance, with the LacI/GalR HTH family (45%). Thus, evidences of a common DNA-binding domain ancestor for the whole GntR family are highlighted by the conserved structural topology rather than  Table I. Consensus sequences result from the multiple alignment of all GntR-like members and not only those listed in Table I. The high and low consensus levels were fixed arbitrarily at 80 and 40% of identity and are represented, respectively, by capital and lowercase letters. The similarity level was fixed at 80%. Symbols for conserved amino acid properties are as follows: !, conserved hydrophobic residues (ILVAMFYW); @, aromatic residues (FYW); Ϫ, negatively charged residues (ED); ϩ, positively charged residues (RKH); E, small residues (GSATPN). 2 and indicate, in panel a, residues implicated in effector binding and dimerization of the FadR protein (52,53). Also in panel a, the underlined residue indicates mutations that affect gluconate binding ability in GntR (60). In panel d, the underlined residue in the consensus corresponds to the lysine that established the covalent link with pyridoxal phosphate in aminotransferases. Spaces in consensus sequences denote insertions within the alignment. by amino acids conservation. When subfamilies are analyzed separately, the levels of identity and similarity rise to 40 and 60%, respectively. Therefore, the C-terminal structural subdivision is reflected on the DNA-binding domain and on the HTH motif itself. In fact, significantly different HTH consensus sequences have been obtained for each subfamily (Fig. 3) except between MocR and YtrA, where the differences are very weak. The fusion between the D-b domain and the E-b/O domain should have occurred separately for the FadR, HutC, and MocR/YtrA subfamilies, and none of the four subfamilies has emerged from one of the three others by internal molecular rearrangements. The high level of similarity observed between the D-b domains of the MocR and YtrA subfamilies also appears in the phylogenetic tree obtained from full-length multiple alignment (Fig. 1). In fact, the two clusters arise from a common branch, highlighting a conserved amino acids composition in their N-terminal region. One of these two subfamilies could have emerged from the other through C-terminal domain replacement.
Only a few ''anomalies'' have been found in the two-dimensional N-terminal structural consensus (␣ 1 ␣ 2 ␣ 3 ␤ 1 ␤ 2 ). The most frequent anomalies were the lack of the first ␣-helix (␣ 1 ) (NtaR from Chelatobacter heintzii and EmoR from the EDTA-degrading bacterium, BNC1) or the presence of an additional helix upstream of ␣ 1 (i.e. WhiH from Streptomyces aureofaciens or PdxR from S. venezuelae). We have also noticed that among YtrA regulators, a third, additional ␤-sheet is frequently predicted before ␣ 1 .
Operator Sites Analysis-Although there is no precise "recognition code" involving a one-to-one correspondence between  (9), it is logical to suppose that highly conserved DNA-binding motifs may bind similar operator sequences. The known or putative inverted repeat operator sites recognized by some GntR-like proteins are compiled in Table II according to our previous C-terminal classification. Looking at the entire family, we observed that almost all bound sites are organized around a constant palindromic 5Ј-(N) y GT(N) x AC(N) y -3Ј sequence. The most important divergence among the various operator sites resides in the number (y) and the nature (N) of the nucleotides that surround the above consensus sequence. Therefore, as observed by Weickert and Adhya (6) for the LacI/GalR family, the center of the palindrome seems to be highly conserved, whereas the peripheral regions diverge. The similar structural environment that resides at the center of the operator is generally considered the molecule-attracting region for these regulators, whereas the peripheral zones perform the operator discrimination role.
The other relevant divergence between operators resides between the 5Ј-GT and 3Ј-AC conserved base pairs. In fact, although there are almost exclusively A and T residues, their number (x) and disposition seems to differ from a subfamily to another. In the FadR and HutC subfamilies we deduced as the consensus 5Ј-t.GTa.tAC.a-3Ј and 5Ј-GT.ta.AC-3Ј, respectively. Moreover, the distance between the half-sites is known to be of maximal importance for a correct operator site presentation on the DNA surface according to the flexibility of the linker between the DNA-binding and the E-b/O domains (72)(73)(74)(75). This distance varies weakly among the FadR and HutC subfamilies, although it fluctuates widely among the YtrA-like regulators. In this last subfamily, the conserved 5Ј-GT and 3Ј-AC residues are found sometimes far from the center of the palindrome. This larger variation among YtrA operators could be attributed to the low complexity of their C-terminal domains, which, added to weaker amino acid conservation, results in a mode of dimer formation specific for each member of the subfamily.  Table I. Consensus sequences result from the multiple alignment of all GntR-like members and not only those listed in Table  I. The high and low consensus levels were fixed arbitrarily at 80 and 40% of identity and are represented, respectively, by capital and lowercase letters. The similarity level was fixed at 80%. Symbols for conserved amino acid properties are as follows: !, conserved hydrophobic residues (ILVAMFYW); @, aromatic residues (FYW); Ϫ, negatively charged residues (ED); ϩ, positively charged residues (RKH); E, small residues (GSATPN). 2 and indicate, in FadR, residues implicated in DNA binding and dimerization (52,53). The mutation of the underlined residues affects the DNA binding ability of AphS (17), FadR (63), and GntR (64). Spaces in the consensus sequences denote insertions within the alignment.
So far, no cis-acting elements have been determined experimentally for the actual studied regulators of the MocR subfamily (PtsJ, PdxR, and MocR), preventing us from determining homologous putative sequences in their promoter regions. This subfamily presents another problem; most of these proteins are of unknown function, and therefore most of the regions upstream of the regulated genes are not available. A comparative study of the upstream regions of MocR-like genes did not revealed any palindromic sequence common to the whole subfamily, and very few MocR-like proteins presented weakly similar putative GntR-like operator. These results suggest either that there is another type of cis-acting element specific to the MocRlike regulators or that autoregulation is not widespread among them. To have an idea of the topology of cis-acting elements typical of the MocR subfamily, interesting data should come from crystallographic studies of the class I aminotransferases. In fact, as highlighted for the tyrosine aminotransferase (TyrB; Swiss-Prot accession no. P04693, Protein Data Bank code 3TAT) from E. coli (61), these proteins present a head-to-tail type of dimerization. As shown in Fig. 4, the head-to-tail configuration is not adapted to inverted repeats but is more appropriate to binding directed repeats that are sufficiently spaced to form DNA looping. Therefore, the lack of typical GntR-like operator sequences in the promoter regions of MocRlike regulators could be attributed to how these proteins should form dimers.
The deduced consensus operator sequences presented in Table II can be used as rapid operator site predicting tools. We tried to detect some of these on Streptomyces coelicolor genome to highlight genes in which expression could be regulated by a member of the HTH GntR-family. We chose the S. coelicolor genome for our investigation because of the exceptional large quantity of GntR-like members sequenced in this strain. A rapid and non-exhaustive search using the DNA motif program 5 revealed about 20 promoter regions that possess a putative GntR-like palindromic sequence. According to the observed reflected C-terminal heterogeneity on operator sequences, the number of putative candidates in binding a specific GntR-like operator site is now reduced, as an investigation of the members of a subfamily would be preferred.
However, we must also mention that few GntR-like regulators recognize operator sites that do not fit into the consensus sequences presented in Table II. It is the case for TraR (44,76), 5 Found on the Web at sanger.ac.uk/Projects/Scoelicolor/.

TABLE II
Comparison of known and predicted palindromic operator sites of GntR-like bacterial For function, bacterial strain, and accession numbers related to the protein abbreviations, see Table I. p, k, and c Putative, known, and consensus sequences, respectively. 1 GlcC from Pseudomonas aeruginosa; 2 Half-site of a directed repeat. Mismatched bases are not highlighted and are shown in lowercase letters. TreR 01 means operator number one of the TreR protein.
AphS and BphS (18,19), and FucR (51), which bind boxes with no clearly defined symmetrical properties. Thus, although the consensus sequences presented in Table II should be regarded as interesting tools, for instance, in making sequencing projects maximally useful, they certainly should not be considered as unerring references, and some GntR regulators should not fit with the general properties highlighted in this study. DISCUSSION The structural, phylogenetic, and functional analysis of about 270 members of the bacterial HTH GntR-family led us to limit the C-terminal E-b/O domain heterogeneity to four major subfamilies that we called FadR, HutC, MocR, and YtrA. The presence of a few proteins escaping from this subdivision suggests that other subfamilies may be identified soon. Among members presenting a C-terminal domain that diverges from the four subfamilies defined above, the most interesting case comes from AraR in B. subtilis. The protein presents a GntRlike DNA-binding domain and a C-terminal domain that is GntR-like and a C-terminal domain typical of the HTH LacI/ GalR family. AraR is a hybrid protein that is able to bind operator sites (AaACTTGT/A/T/ACAAGTaT) (50) that presents the typical GntR signature, and its C-terminal domain binds to a carbohydrate effector molecule (L-arabinose) as do most of the members of the LacI/GalR family. Recently, some proteins presenting this mosaic modular association have been sequenced (i.e. RliB from Lactococcus lactis, ssp. lactis, Swiss-Prot accession no. Q9CFH6; SPY1602 from Streptococcus pyogenes, Swiss-Prot accession no. Q99YP7; CAC1340 from Clostridium acetobutylicum, Swiss-Prot accession no. Q97JE6), confirming in a short time the emergence of new subfamilies.
The fact that C-terminal E-b/O heterogeneity seems to be reflected in the DNA-binding domain and in operator sequences suggests the existence of a tight link between the three regions involved in the regulatory process. This is not really surprising as in vivo, in the evolutionary process, once a gene and its upstream region present a successful functional combination between the three regions involved in gene regulation, it seems legitimate that descendants emerging through gene duplication would present a relative conservation throughout the duplicated sequence. Conservation between the three regions could also be explained from a structural and functional point of view. Dimerization certainly imposes steric constraints on the D-b domain, reducing its mobility with respect to the rest of the protein. According to the studies realized on AraC (72,74,75) (XylS/AraC HTH family) and LexA (73), both from E. coli, such a restricted mobility is thought to be due to interactions between the D-b and E-b/O domains and/or to interactions of part of the linker region with one of the two structural do-mains. These interactions might explain why a regulatory protein is limited, for instance, in its ability to accommodate a wide variation in distances between half-sites of palindromic operator sequences or to form DNA looping when cis-acting elements are separated by a nonintegral number of helix turn. Works on LexA show that the DNA binding ability of a specific domain can be enhanced or diminished by fusing the D-b domain with some alternative dimerization domains (73). These results obtained in vitro could explain why in vivo, among a family that presents a conserved DNA-binding domain, we observed different operator consensus sequences according to the E-b/O heterogeneity.
Finally, we have also delimited how far the information relative to a unique protein can constitute the theoretical and experimental framework of the other members of the family. According to our comparative study, the structural data relative to the FadR protein (52,53) should be regarded as a reference for the whole GntR-family concerning the DNA-binding domain but must be limited to the FadR subfamily concerning the E-b/O domain. Moreover, because of the daily increasing amount of genome sequences listed, it seems essential to update and extend the early comparative studies realized on other families to make sequencing projects maximally useful.