A global view of structure–function relationships in the tautomerase superfamily

The tautomerase superfamily (TSF) consists of more than 11,000 nonredundant sequences present throughout the biosphere. Characterized members have attracted much attention because of the unusual and key catalytic role of an N-terminal proline. These few characterized members catalyze a diverse range of chemical reactions, but the full scale of their chemical capabilities and biological functions remains unknown. To gain new insight into TSF structure–function relationships, we performed a global analysis of similarities across the entire superfamily and computed a sequence similarity network to guide classification into distinct subgroups. Our results indicate that TSF members are found in all domains of life, with most being present in bacteria. The eukaryotic members of the cis-3-chloroacrylic acid dehalogenase subgroup are limited to fungal species, whereas the macrophage migration inhibitory factor subgroup has wide eukaryotic representation (including mammals). Unexpectedly, we found that 346 TSF sequences lack Pro-1, of which 85% are present in the malonate semialdehyde decarboxylase subgroup. The computed network also enabled the identification of similarity paths, namely sequences that link functionally diverse subgroups and exhibit transitional structural features that may help explain reaction divergence. A structure-guided comparison of these linker proteins identified conserved transitions between them, and kinetic analysis paralleled these observations. Phylogenetic reconstruction of the linker set was consistent with these findings. Our results also suggest that contemporary TSF members may have evolved from a short 4-oxalocrotonate tautomerase–like ancestor followed by gene duplication and fusion. Our new linker-guided strategy can be used to enrich the discovery of sequence/structure/function transitions in other enzyme superfamilies.


Table of contents
which then make up the active homohexamer (right). B. Schematic representation of the same three structures. C. The four distinct patterns observed for the experimentally characterized members of the TSF. 1. Homohexamer, which includes the 'short' tautomerases such as the founder 4-OT. 2. Heterohexamer, which includes the hh4-OT and CaaD. Each heterohexamer is composed of three α,β dimers, where the αand β-subunit fold into a β-α-β-motif, but are not S4 identical in sequence. 3. Trimer, which includes 'long' members of the cis-CaaD-, MIF-, MSADand CHMI-subgroups. Note that the monomers in each of these trimers are identical in sequence, but the two β-α-β-motifs within each monomer are not. 4. Homodimer, which includes YdcE, a 4-OT homolog from E. coli. Figure S2. Mapping of the Level 1 cis-CaaD HMM to the Level 1 cis-CaaD subgroup. Details and color for this SSN are as in Figure 2 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup at an HMM hit score of 1e -20 . Nodes that retain the original cyan color of the cis-CaaD subgroup were missed by the HMM trained on this subgroup. The large labeled nodes indicate representative nodes that link the 4-OT and cis-CaaD subgroups as described in the section ""Linkers" between cis-CaaD and 4-OT subgroup identify structural transitions between them." The representative node of founder 4-OT was colored dark blue for consistency with Figure 5, and was not matched by any subgroup HMM other than that generated for the 4-OT level 2 subgroup 1 (see Figure S7). for this SSN are as in Figure 2 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup or other subgroups at an HMM hit score of 1e -14 . Nodes that retain the original magenta color of the MSAD subgroup were missed by the HMM trained on this subgroup. The large labeled nodes indicate representative nodes that link the 4-OT and cis-CaaD subgroups as described in the section ""Linkers" between cis-CaaD and 4-OT subgroup identify structural transitions between them." The representative node of founder 4-OT was colored dark blue for consistency with Figure 5, and was not matched by any subgroup HMM other than that generated for the 4-OT level 2 subgroup 1 (see Figure S7). for this SSN are as in Figure 2 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup at an HMM hit score of 1e -10 . Nodes that retain the original dark blue color of the CHMI subgroup (only the representative founder 4-OT node) were missed by the HMM trained on this subgroup. The large labeled nodes indicate representative nodes that link the 4-OT and cis-CaaD subgroups as described in the section ""Linkers" between cis-CaaD and 4-OT subgroup identify structural transitions between them." The representative node of founder 4-OT was colored dark blue for consistency with Figure 5, and was not matched by any subgroup HMM other than that generated for the 4-OT level 2 subgroup 1 (see Figure S7).  Figure 2 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup or other subgroups at an HMM hit score of 1e -11 . Nodes that retain the original green color of the MIF subgroup were missed by the HMM trained on this subgroup. The large labeled nodes indicate representative nodes that link the 4-OT and cis-CaaD subgroups as described in the section ""Linkers" between cis-CaaD and 4-OT subgroup identify structural transitions between them." The representative node of founder 4-OT was colored dark blue for  Figure 5, and was not matched by any subgroup HMM other than that generated for the 4-OT level 2 subgroup 1 (see Figure S7). Figure S6. 90% sequence identity per node network of Level 2 subgroups of the Level 1 4-OT subgroup. 2,580 network nodes represent 4,530 4-OT like proteins colored by level-2 subgroups. The threshold for drawing edges between representative nodes is 1e -18 . Gray nodes designate nodes that have not been assigned to a level-2 subgroup. HMMs have been generated for subgroup 1 (red), subgroup 2 (green), subgroup 3 (magenta) and subgroup 4 (cyan). The founder 4-OT and the linker Fused-4-OT proteins belong to the Level 2 subgroup 1, as indicated by the labeled large nodes. S10 Figure S7. HMM mapping of the Level 2 subgroup 1 to the Level 1 4-OT subgroup. Details and color as in Figure S6 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup at an HMM hit score of 1e -24 . Nodes that retain the original red color of Level 2 subgroup 1 were missed by the HMM trained on this subgroup. Yellow nodes that are not part of this subgroup are also matched by this HMM at the HMM hit score. S11 Figure S8. HMM mapping of the Level 2 subgroup 2 to the Level 1 4-OT subgroup. Details and color as in Figure S6 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup with an HMM hit score of 1e -23 . Nodes that retain the original green color of Level 2 subgroup 2 were missed by the HMM trained on this subgroup. Yellow nodes that are not part of this subgroup are also matched by this HMM at the HMM hit score. S12 Figure S9. HMM mapping of the Level 2 subgroup 3 to the Level 1 4-OT subgroup. Details and color as in Figure S6 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this with an HMM hit score of 1e -25 . Nodes that retain the original magenta color of Level 2 subgroup 3 were missed by the HMM trained on this subgroup. Yellow nodes that are not part of this subgroup are matched by this HMM at the HMM hit score. S13 Figure S10. HMM mapping of the Level 2 subgroup 4 to the Level 1 4-OT subgroup. Details and color as in Figure S6 except that yellow designates each representative node that contains at least one sequence for which its cognate HMM hits this subgroup with an HMM hit score of 1e -22 . Nodes that retain the original cyan color of Level 2 subgroup 4 were missed by the HMM trained on this subgroup. Yellow nodes that are not part of this subgroup are also matched by this HMM at the HMM hit score. in sequence lengths per subgroup is apparent. Sequences between 58-84 residues in length, which fold into a single β-α-β-unit, largely belong to the 4-OT-subgroup. Sequences between 110-150 residues in length are found in the other subgroups and fold into two fused β-α-β-units.

S5
There is "crosstalk" between these two populations, most notably the Fused 4-OTs. These   The threshold for drawing edges is 1e -11 . The positions of the linker nodes are highlighted as shown in Figure 5, along with the positions of two additional linker sequences, N1 and N2.
These latter nodes refer to the second pair of representative nodes besides Linker 1 and Linker 2 that link the 4-OT and cis-CaaD subgroups as shown in Figure 5. In this SSN, 48 edges link 31 nodes from the founder 4-OT subgroup to 17 nodes in the cis-CaaD subgroup. S17 Figure S14. MSA of sequences used to calculate the phylogenetic tree.    As it was surprising that multiple sequences lack an N-terminal proline and are represented globally across the superfamily, it was important to validate this result in detail. Initial bioinformatics identified 1,801 non-eukaryotic sequences without an N-terminal proline.

I FVHG L HREGRSA D L K GQ L A QR I VDDVS V AA E I DRKH I --WVY FGE MPA QQM -VE Y GR F L P QPG --HE GE WFDN L S S DE R I WVQA T I RSGR TE K QK E E L L L R L TQE I A L I L G I P NEE V --WVY I TE I PGS NM -TE Y GR L L ME PG --E E EK WFNS L P E GL R A I I RGQ I RAGRPQA TRHA I L RA I TDV YMA V TGADA NA V --V VA V VD I PA S WA -ME -------------------------L MFV V D F I EGR TE E QRNA L I AA L SK TG TE T TG I P E SDV --RVR L L D FPK A NMGMA GG I S AK A MG --R -------------P V I V A I L I AGR
(Eukaryotic sequences were not examined due to the complex nature of their splicing maturation.) Each nucleotide sequence was examined manually and most were removed from the set due to misannotation of the position of the initiating Met or other technical issues (see below). Ultimately, 346 sequences were validated as missing an N-terminal proline.
An example of a properly annotated sequence missing Pro-1 is shown in A). W0R8J4  MISVYGLKHTLAPRRALIAEIIFSCLEVNLGIPKQRHALRFELLDDENFYLPVNRSDNFL  I3DIP6  MITVYGLKKSLAPYRKQIADAIFHCLHIGLGIPPRKHTLRFVGLEKEDFYLPINRTERFI  I3DR40  MITVFGLKSKLAPRREQLAEVIYNSLYLGLDIPKGKHAIRFLCLEKEDFYYPLDRSDDYT  A4N5E7  MITVFGLKSKLSPRREQLAEVIYNSLHLGLDIPKGKHAIRFLCLEKEDFYYPFDRSDDYT   E2PB85  GGTGGAATGGTTATTAGGAGAAAAAAATGATTAGTGTATATGGATTAAAACAAACCTTAG  W0R8J4  AGTTCAATGGCTATTGGAGGCAAAAAATGATTAGTGTTTACGGATTAAAACACACACTTG  I3DIP6  TGGCTATTAGCTCGTAAGGAGACATTATGATTACCGTTTATGGTTTAAAAAAATCACTTG  I3DR40  TTTAACCGCACTTTAAAGGAGAAAAAATGATCACCGTATTCGGACTTAAATCCAAACTCG  A4N5E7  TTTAACCGCACTTTAAAGGAGAAAAAATGATTACTGTATTCGGACTTAAATCCAAACTTT A. Top. Multiple sequence alignment of five TSF homologs for which Met-1 is not followed by a proline, which is a deviation from the signature feature of TSF members. However, these sequences clearly align very well. Bottom. For all five sequences, their annotated start codons are associated with a Shine-Dalgarno-type sequence, indicating that the annotated start codon is likely the correct one. Indeed, no alternative start codons can be found nearby in which Met S21 would be followed by Pro and where this alternative start codons is also associated with a Shine-Dalgarno-type sequence.

E2PB85 MISVYGLKQTLADRRALIADVIFDCMQMSLGVPKQRHALRFDLLDAENFYPPINRSQDFI
An example of two misannotated sequences missing an N-terminal Pro are shown in B) and the same examples in which the misannoation has been corrected by identification of the true Met1 site is shown in C). Met-1 is followed by a proline, is additional evidence that the start codons are now correctly annotated.

S23
File S1. PDB codes of structures used in the structure similarity network provided in Figure 8.

S24
Notes: Structures used in Figure 8 and in the MSA for Figure 5 differ for 4-OT and cis-CaaD.
For 4-OT, PDB 1BJP is used in the MSA for Figure 5 and PDB 4OTA is used in Figure 8. For cis-CaaD, PDB 2FLZ is used in the MSA for Figure 5 and PDBs 3EJ3 and 3MB2 are used in Figure 8. Two chains each from cis-CaaD structures 3EJ3 and 3MB2 were used in Figure 8 as these proteins are organized as part of physiological heterohexamers in which the B and C chains are different.