Oxidative opening of the aromatic ring: Tracing the natural history of a large superfamily of dioxygenase domains and their relatives

A diverse collection of enzymes comprising the protocatechuate dioxygenases (PCADs) has been characterized in several extradiol aromatic compound degradation pathways. Structural studies have shown a relationship between PCADs and the more broadly-distributed, functionally enigmatic Memo domain linked to several human diseases. To better understand the evolution of this PCAD–Memo protein superfamily, we explored their structural and functional determinants to establish a unified evolutionary framework, identifying 15 clearly-delineable families, including a previously-underappreciated diversity in five Memo clade families. We place the superfamily's origin within the greater radiation of the nucleoside phosphorylase/hydrolase-peptide/amidohydrolase fold prior to the last universal common ancestor of all extant organisms. In addition to identifying active-site residues across the superfamily, we describe three distinct, structurally-variable regions emanating from the core scaffold often housing conserved residues specific to individual families. These were predicted to contribute to the active-site pocket, potentially in substrate specificity and allosteric regulation. We also identified several previously-undescribed conserved genome contexts, providing insight into potentially novel substrates in PCAD clade families. We extend known conserved contextual associations for the Memo clade beyond previously-described associations with the AMMECR1 domain and a radical S-adenosylmethionine family domain. These observations point to two distinct yet potentially overlapping contexts wherein the elusive molecular function of the Memo domain could be finally resolved, thereby linking it to nucleotide base and aliphatic isoprenoid modification. In total, this report throws light on the functions of large swaths of the experimentally-uncharacterized PCAD–Memo families.

The aromatic ring is an exceptionally stable molecular arrangement that is integral to a wide range of biomolecules. Its unique properties are central to the functions of essential biopolymers, like proteins (side chains of specific amino acids) and nucleic acids (the bases), as well as low-molecular-weight metabolites and signaling messengers. Their stability relative to other organic molecules also poses paradoxical challenges to the cell: catalyzing reactions that allow their utilization, appropriate modification, and proper incorporation into macromolecular building blocks on the one hand and the prevention of the accumulation of versions that can be toxic on the other hand. As part of this cellular management of compounds with aromatic rings, diverse molecular mechanisms have evolved for their opening and catabolism.
Enzymes involved in the breakage of aromatic rings act via oxidative ring activation, resulting in cleavage at an intradiol bond (class I) or an extradiol bond (class II), or in some cases independently of a diol (class III) (Fig. 1A) (1,2). Extradiol bond-cleaving enzyme superfamilies catalyze ring breakage assisted by a coordinated metal ion and display at least three unrelated protein folds: 1) the vicinal oxygen chelate dioxygenase superfamily of the glyoxalase fold (type I extradiol dioxygenases); 2) the protocatechuate dioxygenase (PCAD) 4 superfamily of the phosphorylase/peptidyl hydrolase fold (type II); and 3) the cupin family dioxygenases of the double-stranded ␤-helix fold (type III). Each of these extradiol dioxygenase "types" has been implicated in the catabolism of a wide range of aromatic ring-containing substrates in diverse pathways (1,(3)(4)(5).
The type II extradiol dioxygenases of the PCAD superfamily share a common fold with catalytically unrelated clades of enzymes, namely the nucleosidase/nucleoside phosphorylases (PNP) and a diverse assemblage of peptidyl/amidohydrolases (6). The PCAD domains further show close structural and sequence affinities with the Memo family (7). The Memo pro-tein was first identified as a mediator of ErbB2-induced cell motility in breast cancer cell lines (8). Subsequent studies have largely coalesced on the view that Memo acts as a general regulator of cell motility-related pathways with proposed involvement in actin reorganization (9), microtubule capture (10,11), vascular development (12), and tumor migration (13). Although initial research pointed to a primary functional role in nonenzymatic phosphopeptide binding (7), subsequent research established metal ion-binding for Memo and thus pointed to a potential enzymatic role (13). The limited data collected on Memo enzyme activity to date do not implicate Memo in aromatic ring cleavage reactions (7,13), instead implicating it in the reduction of molecular oxygen and generation of reactive oxygen species (13,14). However, independently of this experimental data, comparative genome analyses identified conserved gene-neighborhoods that are again consistent with an enzymatic role in the modification of nucleic acid bases or lipids (15)(16)(17)(18).
Although the reaction mechanisms, substrates, and family diversity of the PCAD domains have been studied to varying degrees in the past (19 -21), an understanding of their total evolutionary history is generally lacking. To reconcile these findings with the role of the poorly-understood Memo and to better understand the internal relationships and the provenance of the PCAD-Memo superfamily, we initiated a comprehensive comparative genomic analysis. In our analysis, we sought to specifically address certain lacunae in the current literature, including 1) phyletic distributions of the PCAD clades based on complete genome sequences; 2) superfamilywide substrate specialization along with prediction of novel pathways; 3) the extent of the dispersal of aromatic degradation pathways across organisms; 4) the origin, inter-family evolutionary relationships of the PCAD enzymes and the higherorder relationships with the Memo domain.
Through these analyses, we elucidated the evolutionary history of the unified superfamily, observing that the PCAD dioxygenases likely descended from the more ancient Memo-like clade, which in turn descends via a single circular permutation event of the protein fold from a PNP-peptidyl/amidohydrolaselike prototype with which they share a similarly located activesite pocket. This history establishes that the shift to a dedicated aromatic ring opening function likely happened only after the PCAD clade had diverged from the Memo-like clade. These observations open novel avenues of investigation into the precise molecular function of Memo, which remains poorly-understood despite increasing experimental evidence in the last decade linking it to various human diseases. Additionally, this analysis presents the first comprehensive comparative genomic account of the aromatic ring-opening dioxygenases of the PCAD superfamily, reporting a range of known and newly-predicted substrates as well as structural features influencing substrate recognition and likely processing efficiency. Finally, we report the discovery of a previously-unrecognized family in the PCAD superfamily present across the Actinobacteria, which, although likely an enzymatically inactive version of the superfamily domain, appears to function in the synthesis of a metabolite derived from a modified nucleotide.

Assembly of the sequence and structure complement of the PCAD-Memo superfamily and its classification
To collect members of the PCAD-Memo superfamily, we initiated iterative sensitive sequence-based similarity searches using known members of the PCAD superfamily as seeds (see under "Experimental procedures"). Searches were run until convergence; sequences retrieved with borderline statistical significance (e-values between 0.01 and 0.001) were used as seeds in new searches to confirm membership in the superfamily. As an example, a search initiated with a sequence from the AMMECR1 fusion family (accession number: WP_006301058.1) returns a 2-aminophenol 1,6-dioxygenase (APD) family member from Cupriavidus (accession number: AEI74584.1; e-value: 2e-26; iteration: 2), a Rv2728c-like family member from Mycobacterium sp. 141 (accession number: WP_019969303.1; e-value: 9e-22; iteration: 2), a classical Memo family member from Arabidopsis thaliana (accession number: NP_565590.1; e-value 5e-04; iteration: 2), and a YgiD family member from Magnaporthe oryzae (accession number: ELQ40968.1; e-value: 3e-04; iteration: 3).
Similarly, we initiated structural similarity searches using known crystal structures as seeds (see under "Experimental procedures"). These searches retrieved increasingly distant members of the PCAD dioxygenase-Memo superfamily relative to the starting structure, followed by other structures belonging to the phosphorylase/peptidyl hydrolase fold, namely the purine nucleoside phosphorylases (PNP superfamily). For example, a structure-based homology search initiated with a crystal structure from the classical LigB family (PDB code 1B4U_B) first returns other PCAD structures (examples: 3VSH_D, z-score: 24.5; 2PW6_A, z-score: 19.2), then Memo family structures (example: 3BD0_A, z-score 15.3), and then PNP and peptidyl/amidohydrolase structures (example: 1TCV_A, z-score: 5.8). A search initiated with a Memo family member (PDB code 3BCZ_A) first detects other Memo structures and then detects various PCAD structures (examples: 5HEE_A, z-score: 19.0; 1B4U_B, z-score 15.3; 2PW6_A, z-score: 14.5), before finally detecting PNP and peptidyl/amidohydrolase structures (example: 1TCV_A, z-score 6.5).
The collected PCAD complement of 6619 sequences was subject to clustering analyses to gain insight into the distinct families composing the superfamily (see "Experimental procedures") (22). In total, clustering by BLASTCLUST identified 15 clearly-delineable families in the PCAD-Memo superfamily, which, in accordance with the structure search results, revealed a fundamental divide between the PCAD and Memo clades of the superfamily. Five families were found to belong to the Memo clade, and nine clearly fell into the PCAD clade, and a single family falling into the PCAD clade in the sequence similarity network (Fig. 1, B and C), but not strongly associating with either clade in the BLASTCLUST analysis. The relationships among these families and the sparseness of experimentally-characterized versions are shown in a sequence similarity network depicted in Fig. 1.

Structure and sequence features of the PCAD-Memo superfamily
To provide context for a study of the natural history of the PCAD-Memo superfamily, we analyzed the above-collected complement of PCAD-Memo-solved structures and retrieved sequences to define the essential core of the domain and to catalog the complete range of structure and sequence variations. Table 1 presents the summary of the conserved and variable features identified across the superfamily, and these are further discussed below.

Structural core of the superfamily
The core fold of the PCAD-Memo superfamily is a ␣/␤ three-layered sandwich with eight conserved strands forming a central ␤-sheet. There are seven conserved helices that sandwich the central sheet with two in one layer and five in the opposite layer ( Fig. 2A). Comparison of the solved structures alongside multiple sequence alignments constructed for the families currently lacking any structure revealed three loop regions that consistently exhibit structural variation, and we hereafter refer to them as variable regions 1-3 (VR1-3; see below, Fig. 2A, and Table 1). Another region occasionally displaying variability across individual families is the "dimerization" loop between strand 2 (S2) and strand 3 (S3) and helix 6 (H6), which can sometimes "break" into two distinct elements (referred to as H6a and H6b in Table 1).
Previous research on the PCAD-Memo superfamily had reported extensive sequence diversity between and within individual families (23). Our analysis suggests that despite this diversity, several key sequence positions are conserved across the superfamily (see alignment columns shaded in black with white lettering, denoted by an asterisk in Fig. 3, Table 1, and Fig.  S1). In total, eight positions are well-conserved across all enzymatically-active families in the superfamily; mapping of these positions onto available structures indicates that these are closely associated with the conserved active-site pocket ( Fig. 2 and Table 1). Three of these positions ( Fig. 2A and Fig. 3), namely the histidines after S1 and S2 and the acidic residue or cysteine between H6 and H7, are residues coordinating the active-site metal ion (7, 24 -26). The histidine downstream of S6 is farther away from the coordinated metal but close to the substrate suggesting that it might contribute to aromatic substrate selectivity viainteractions ( Figs. 2A and 3). Furthermore, the aromatic residue at the N terminus of H3 formsinteractions with both the above histidines coordinating the metal ion and may help maintain a delocalized -cloud system for effective chelation (7, 24 -26). The remaining conserved residues either help position the above residues or maintain the active-site conformation through hydrogen  Table 1 Capital letters in column headers denote residues conserved across the superfamily, provided in linear order of appearance relative to core structural elements
bonding. These observations also indicate that the PCAD-Memo superfamily falls entirely in the range of sequence conservation in the face of structural variation that has been described for several other families of enzymatic protein domains (27). The substrate is positioned within the active site via coordination interactions between the diol oxygens or a phenol oxygen and amino group nitrogen and the metal. This strongly suggests that the PCAD family is likely to be specific for substrates that contain an aromatic diol. The substrate is shielded from the solvent by elements of VR1 and sometimes VR3 or an additional subunit (e.g. LigA; see below) such that only the extradiol bond, which is cleaved, is exposed for attack by dioxygen. Beyond the above eight site residues closely associated with the active site, certain other well-conserved charged or polar positions were identified. Notable among these is a nearly absolutely-conserved aspartate residue in H5 (see column shaded in gray in Fig. 3, denoted by a caret symbol). While not extending into the active site itself, it appears to provide a stabilizing backbone contact that positions the loop between S6 and H5 as a "wall" delimiting one side of the active-site pocket (Fig. 3).

Variable regions and their relation to PCAD-Memo function and regulation
The above-mentioned three structurally variable regions in the superfamily (Fig. 2) occur between S1 and H1 (VR1), S3 and H2 (VR2), and S6 and H5 (VR3) ( Fig. 2A and Table 1). VR1 tends to be the shortest and often takes the form of a small helical segment. The most minimal version of this represents a unifying feature across families of the Memo clade (Table 1 and Fig. 2B). More complex versions occur in clades like the LigZlike family and take the form of longer predicted ␣-helical  T T I T S S H I 17G P V F K G Y Q P I R D W I6 P D V V I L V 1 N D H A S A6 P T F A I G C20 G H P D L A W H I A Q S L I L3 D M T I M N Q M D V D H G C T V P L S M I F G E P3 P C K V I P F P V N V V  ligB_Bsax_ _005329507.1  YP  3 D V I W L A T S H V 17K D L F D G Y A P A R A W L4 P D V A I V V 1 N D H A N G6 P T F G I G T20 G D P E L S L H L L E Q L V D3 D L T V F Q E L E V D H G L T V P L S V W T P D P4 P C P V V P I L V N V I  Shyd_  29042.1  BAO  3 R I T A V Y T S H V 17Q P L F A G Y D K S K Q W L4 P D V I F L V 1 N D H A T A6 P T F A I G C20 G H P E L A A H I A Q S V I Q3 D L T I

Natural history of the PCAD-Memo superfamily
inserts. VR2 tends to be structurally more diverse than VR1 (Table 1). It might range from a complete lack of an insert (the Rv2728c-like family), through relatively simple forms consisting of predicted extended loops (e.g. the PqqD-like fusion and LigZ families), to large inserts composed of multiple delineable secondary structure elements in several families. For instance, in the Memo family VR2 contains elements stacking with core elements (7); in the APD family it contains a ␤-␣-␤ unit, and in the HpcB/HpaD-like family-1 it is predicted to contain a fourstranded ␤-meander; several other families are predicted to form various ␣-␤ structures (Table 1). Solved structures for the APD and PF0612 (Table 1) provide an indication of the likely spatial positioning of the more elaborate VR2 inserts: in both cases, the VR2 inserts form a partial fourth layer to the core three-layered sandwich, stacking outside of the "forward" helical layer (Fig. 2B). The final variable region, VR3, typically forms or is predicted to form elongated loops in the proximity of the active-site pocket, likely acting as a cap, which at least in some families could participate directly in structural occlusion of the pocket. In some families, VR3 is predicted to house more structured elements, such as an insert with two ␣-helices in a subset of the classical LigB family (see below, Table 1, and Figs. 2 and 3).
These variable regions and their family-specific diversity (Table 1) are of note primarily for the following. 1) Their spatial positioning relative to the active-site pocket reveals that they contribute noticeably to the variability of the walls of the pocket (Fig. 2B). 2) VR1 and VR2 often contain family-specific conserved residues (Table 1), as noted previously for certain families in the PCAD-Memo superfamily (23). Mapping of the family-specific residues onto solved structures indicates that these generally fall into one of three categories: 1) residues that perform structural roles relating to formation of the active-site pocket (e.g. the conservation of prolines in VR1); 2) residues that contribute directly or indirectly to the active site (e.g. conserved aspartate residue Asp-73 in solved structure PDB code 3VSI of the APD family (Table 1, supporting data)); or 3) residues that are not directly involved in the active site but appear suitably positioned for allosteric regulation, as described previously in some families (e.g. conserved polar residue, often a glutamate, in the classical LigB family) (23). Residues from the final category often contribute to clearly-definable ancillary pockets in proximity to the active-site pocket. Using the experimentally-defined allosteric activating pocket for the LigB family (23), we found that four of the five families with solved crystal structures contain comparably-situated pockets. This suggests that allosteric regulation through these ancillary pockets could be a broader functional principle of the PCAD-Memo superfamily (Fig. 2B).

Multimerization
Prior experiments have shown that some members of the classical LigB family function as homodimers (24,28). A range of other configurations have also been described: the gallate dioxygenase GalA, a member of the classical LigB family, functions as a homotrimer (29); the APD family forms a heterodimer between active and inactive paralogs (30); and the YgiD-like family has thus far been characterized as a functional monomer in bacteria (31) and plants (32). The significance of potential multimerization in the Memo clade remains unexplored. This variety in multimerization, including potentially differing modes even within the same family, could point to a role in fine-tuning substrate specificity and/or potentially influencing allosteric regulatory mechanisms in PCAD-Memo proteins.
The currently available structural data for the PCAD-Memo superfamily suggests a tendency for dimerization to occur endto-end at the S4 edge of the core ␤-sheet, with the sheet in one monomer inverted in polarity relative to the opposing monomer sheet. The core ␤-sheets do not stack directly with each other, instead, the edge of one core sheet stacks with the S2-S3 loop (the "dimerization" loop noted above) and vice versa ( Fig.  2A), comparable with what is seen in the classical LigB family. In the solved structure of the PF0612, this loop adopts a small ␤-strand conformation, which stacks alongside an extended S4 strand from the opposite monomer. This strand-stacking results in the formation of a barrel-like structure at the dimer interface. A third and considerably more elaborate dimerization mode is observed in the APD family, wherein the dimerization loop is captured by a "brace" of the opposing monomer.
Secondary structure prediction and sequence alignment of individual families without solved structures suggest that the dimerization modes are likely to more closely approximate PF0612 and classical LigB dimerization rather than some of the more aberrant modes listed above. Further experimental determination of the dimerization-pattern-space for the superfamily has the potential to throw light on any possible relationship between multimerization and substrate specificity/recognition, regulation of enzymatic activity, and the possible role of catalytic inactivation of one of the subunits in the heterodimeric forms.

Classification of the PCAD-Memo superfamily
Reconstruction of the evolutionary history of protein domains allows the identification of shared and derived sequence and structure characters contributing to the functional diversification of families. This assists with the prediction of function in poorly-understood families within a superfamily. To this end, the families of the PCAD-Memo superfamily identified through collection and clustering were analyzed to establish a comprehensive natural history for the superfamily; these results are summarized in Table 1 and the supporting data. Additionally, internal relationships within individual families, where relevant, were established using conventional phylogenetic analyses (see "Experimental procedures"), and phyletic profiles were constructed based on these and clustering results. Furthermore, contextual information in the form of domain architectures and conserved gene neighborhoods were cataloged across the superfamily (see "Experimental procedures"). At this point, higher-order relationships between individual families were determined through comparison of shared structural features and other synapomorphies (shared derived characters). These relationships were then placed within the framework of their identified phyletic patterns to infer a relative temporal evolutionary scenario for the diversification of the Natural history of the PCAD-Memo superfamily PCAD-Memo superfamily. This is depicted as a temporal diagram (Fig. 4).
Our analyses confirm a fundamental divide in the superfamily between families of the "Memo-like" clade and those of the "PCAD-like" clade (Fig. 4). The Memo family itself, given its status as the only lineage conserved across all three Superkingdoms of Life, is the probable founder of the superfamily that diverged from the PNP-peptidyl/amidohydrolase clades, which share the same fold, sometime prior to the emergence of the Last Universal Common Ancestor (LUCA) (see below for further discussion). Barring the YgiD-like family, the other PCAD clade families are predominantly found in bacteria with only occasional occurrences outside, which can be attributed to secondary horizontal transfers ( Fig. 4 and supporting data). Hence, we infer that the PCAD clade likely branched from the Memo family in bacteria after the time of the bacterial/archaeal split. We discuss further details of the classification below along with the conserved aspects of genome contextual information. Such information has previously been observed to be an effective means of predicting function of poorly-understood protein domain families (33,34). However, as has been previously noted (20,35), substrate promiscuity in the superfamily often complicates functional assignment/assessment at the family level. Particularly in the PCAD clade, functional partners are not always directly linked to each other on the genome (20,21), precluding functional assignment by using conserved gene neighborhood analysis alone. However, several examples of previously-unrecognized associations point toward potential novel roles for several members of the superfamily.

PCAD clade
Conserved gene neighborhoods of interest for this clade are presented in Fig. 5, with known and predicted aromatic ringopening reactions catalyzed by them provided in Fig. 6 for reference.

Classical LigB family
This family is extensively represented in ␣-, ␤-, and ␥-proteobacteria and the actinobacteria, with only scattered representation in other prokaryotes (supporting data). Several of the currently biochemically characterized PCAD clade enzymes are members of this family (20,21,28,(35)(36)(37)(38). These studies demonstrate their remarkable substrate diversity and promiscuity, sometimes through directly assaying catalytic efficiencies for enzymes against substrate panels, and suggest that the family is central to the degradation network of protocatechuate (PCA) and other phenolic compounds, particularly in the catabolism of lignins. The key to understanding the functional diversification of this family is a broad structural division that we identify for the first time in this study: the family can be divided into two subfamilies based on the presence or absence of a gene coding for LigA in the genome context. Versions lacking the LigA association, without exception, contain a large insert, which is unprecedented across the superfamily at the VR3 site (Figs. 2A, 3, and 5A, Table 1, and supporting data). Hereafter, we distinguish these subfamilies as the LigA-associating and insert-containing subfamilies.
The LigA protein is an all ␣-helical domain that forms a higher-order phylogenetic assemblage with several other helical domains, previously thought to have no relatives, including the Frankia-peptide domain (39). 5 LigA itself is only observed in contexts with the classical LigB family. Previously solved structures have captured the LigA-LigB complex, which shows the LigA domain forming a helical cap-like structure over the LigB active site. In other enzyme superfamilies, such as the haloacid dehalogenase (HAD) superfamily of phosphoesterases (40), the cap modules help exclude water from the active site and "seal the reaction chamber" during catalysis, and LigA might play a comparable role for LigB. LigA can be fused directly to the PCAD domain of LigB at either the N or C termini or can be encoded by the gene immediately upstream or downstream of the LigB encoding in a conserved gene neighborhood ( Fig. 5B  and supporting data).
The VR3 insert of the insert-containing subfamily (Fig. 5A) is predicted to form at least two ␣-helices and occupy a spatially equivalent location to the LigA domain directly above the PCAD active-site pocket. The mutual exclusivity of the insert and LigA along with the predicted similarity in their positioning strongly suggests that the insert acts as a functional equivalent to the LigA domain, potentially occluding the active site in the course of the reaction. Given the respective phyletic distributions of these two subfamilies, it appears likely that the insertcontaining clade emerged from a LigA-associating precursor through acquisition of the of insert followed by loss of LigA (supporting data).
We also observed a clear divide in the conserved gene-neighborhoods between the two subfamilies. Certain contexts for the LigA-associating subfamily have been reviewed previously (21) and are consistently observed as associating with genes encoding enzymes involved in PCA degradation, i.e. the 4,5-cleavage of PCA leading to oxaloacetate and pyruvate production (Fig.  6A). In contrast, the predominant genome contexts for the insert-containing subfamily point to a role in the degradation of trans-cinnamate (derived from benzoate) with all enzymes required to convert trans-cinnamate to acetyl-CoA being present in the neighborhood (see below for further discussion on substrate promiscuity in the classical LigB family, Figs. 5, C-J, and 6, and supporting data). This subfamily includes the wellstudied MhpB and HppB enzymes (Fig. 6B) (38,(41)(42)(43), the former of which has been experimentally implicated in promiscuity by catalytic efficiency studies (38).

LigZ-like family
The LigZ-like family contains two previously-studied members, the titular LigZ and DesZ, both of which have been studied in distinct branches of the lignin degradation pathways (28, 44 -46). Consistent with functions in lignin degradation, the LigZ family shares several sequence and structural features with the classical LigB family (Fig. 3 and Table 1). However, the LigZ family is distinguished from the other families of the superfamily by the presence of a predicted, likely ␣-helical insert with a highly-conserved tryptophan in VR1 (Table 1 and supporting data).

HpcB/HpaD-like families
Two distinct families in the PCAD clade have been experimentally characterized as 3,4-dihydroxyphenylacetate 2,3-dioxygenases, which catalyze the degradation of 3,4-dihydroxyphenylacetate (also known as homoprotocatechuate) to 2-hydroxy-5-carboxymethylmuconate semialdehyde (Fig. 6D). The two families share several structural and sequence features and accordingly group together in clustering analysis, arguing for a higher-order relationship between the two families, including a predicted ␤-strand-rich region containing at least

Natural history of the PCAD-Memo superfamily
one shared sequence motif in VR2 (Table 1, Figs. 1 and 4, and supporting data). Most members of both families are part of tyrosine/tyramine/dopamine degradation pathways (48 -51) and are almost always encoded as part of the 3,4-dihydroxyphenylacetic acid catabolon (Fig. 5N), which includes enzymes needed for processing of tyramine and related substrates to succinate (49). Consistent with previous reports, we observed that this regulon is found in several discrete units on the genome, and the composition of these units can vary between different species (supporting data). In a subset of firmicutes, the HpcB/HpaD-like family members, previously characterized in Paenibacillus sp. JJ-1b as a protocatechuate 2,3-dioxygenase (2,3-PCD), appear in a distinct 4-hydroxybenzoate degradation pathway context (Figs. 5O and 6A) (52,53). This acquisition of a novel substrate again points to a certain degree of substrate promiscuity as noted for the above families, also supported by previous turnover number studies on 2,3-PCD (see further discussion below) (53).

APD family
This well-studied family further illustrates the versatility of the PCAD superfamily in utilizing diverse aromatic substrates; it contributes to the degradation of both 2-aminophenol ( Fig.  6E, 1,6-ApDO) and 2-amino-5-chlorophenol (Fig. 6F, CnbCab) (30, 54 -57). Although only sporadically observed in prokaryotes, the family is found across a diverse phyletic range, including several proteobacteria, actinobacteria, and on rare occasions in euryarchaeota. The family underwent an early duplication, spawning a closely-related yet inactive version that is tightly-coupled to the active version on the genome. The two then form a functional heterodimer with an unusual mode of dimerization (see above, Fig. 3, and supporting data). Phyletic pattern analysis suggests the gene pair has been disseminated via HGT, typically along with other downstream members of the nitrobenzene degradation pathway including 2-aminomuconic semialdehyde dehydrogenase and 2-aminomuconate deaminase (supporting data).

p-Cumate-degrading family
This family is restricted to a subset of actinobacteria and the deltaproteobacterium Candidatus Entotheonella. Although biochemically uncharacterized, its genomic contexts suggest that it is part of a pathway that catabolizes p-cumate to pyruvate ( Fig. 5P and supporting data). A related pathway has been experimentally characterized in other organisms including Bacillus and Pseudomonas. In these organisms, the pathway proceeds through the unrelated glyoxylase fold extradiol dioxygenase (58), which appears to the be functional analog of this family suggesting potential analogous gene displacement between these families in the p-cumate utilization pathway (Fig.  6M).

YgiD-like family
This family is prototyped by the Escherichia coli YgiD protein and is the sole PCAD family to have widespread membership outside of the bacteria, with representatives found in eukaryotes such as fungi, plants, amoebozoans, kinetoplastids, certain chromalveolates, and few basal eukaryotes ( Fig. 4 and supporting data). In addition, a sporadic presence is observed in Archaea, which similarly crosses diverse lineages, with at least one monophyletic clade containing representatives from Euryarchaea, Crenarchaea, and Asgardarchaea (supporting data).
Some eukaryotic YgiD-like sequences have been experimentally-characterized as 3,4-dihydroxy-L-phenylalanine (L-DOPA) dioxygenases (Fig. 6G) functioning in betalain biosynthesis pathways, for example in plants (32, 59, 60) and fungi (61,62). Betalains are water-soluble, indole-derived pigments replacing anthocyanins in a limited set of plants, mostly of the order Caryophyllales and some fungi. Very recently, the E. coli YgiD was shown to exhibit DOPA dioxygenase activity in vitro (31). Given the limited distribution of betalain pigments and the much broader distribution of the YgiD-like family, it appears likely that betalain production is a specialized function of YgiDlike domains that is probably restricted to certain terminal eukaryotic lineages. The remaining members of the family, like the E. coli enzyme, could therefore be involved in channeling oxidized phenylalanine/tyrosine derivatives into other pathways or potentially be involved in the ring-opening of distinct substrates (see further discussion below and Fig. 5, Q-T). The lone YgiD crystal structure from E. coli (the Southeast Collaboratory for Structural Genomics (63)) presents a rather unusual active-site configuration for YgiD: the histidine at the N terminus of H3 is drawn away from its usual position to coordinate a second Zn 2ϩ ion; similarly, the histidine downstream of S6 and the glutamate between H6 and H7 are drawn away from their usual positions to coordinate a 3rd Zn 2ϩ ion. The three Zn 2ϩ ions define a 120 o -30 o -30 o triangle, and accordingly, the active-site pocket is unusually wide and "open" on the protein surface when compared with structures of other representatives in the superfamily (Fig. 2B). This feature suggests specialization to accommodate a larger substrate like L-DOPA with its carboxylate group coordinating with one of the additional Zn 2ϩ ions. Furthermore, this active site could also favor the spontaneous cyclization of the oxidation product 4,5-seco-Dopa to betalamic acid that is observed downstream of the YgiDcatalyzed reaction (Fig. 6G). Alternatively, this active site might facilitate some substrate promiscuity.

PF0612 family
This family of PCAD domains, despite the recent publication of a solved structure (26), remains functionally enigmatic. No conserved genome context information was detectable for the family. It appears to have been transferred to certain archaeal lineages ( Fig. 4 and supporting data).

AMMECR1 fusion family
This uncharacterized family is predominantly observed in the firmicutes; however, it appears to have been sporadically disseminated via HGT to several additional lineages, including proteobacteria, actinobacteria, synergistes, spirochaetes, and verrucomicrobia. This family is directly fused to a C-terminal AMMECR1 domain (15,64) or in some cases contextually linked to the same in a conserved gene-neighborhood. These are further associated in a conserved gene-neighborhood with a radical SAM enzyme fused to a C-terminal zinc-ribbon domain ( Fig. 5U and supporting data). Other associations reported for Natural history of the PCAD-Memo superfamily the AMMECR1 domain, coupled with previous experimental work on diverse radical SAM domain families, led to the prediction that this conserved gene neighborhood was likely catalyzing a nucleotide or nucleotide-base modification reaction (15,16).
Strikingly, an association with orthologous radical SAM and AMMECR1 genes also occurs in the more ancient Memo clade of the superfamily (Fig. 4; see below), but Memo protein orthologs are largely absent in firmicutes. The mutual exclusivity of their respective phyletic patterns suggests this family of the PCAD clade displaced the Memo domain in this conserved three-gene island at or near the base of the firmicutes lineage (Fig. S2). Several other features are also shared between the Memo and AMMECR1 fusion family, including the notable conservation of a cysteine in place of an acidic residue in the loop between H6 and H7 ( Fig. 3 and Table 1) seen in the other families. These potentially represent late, atavistic re-acquisition of Memo-like features in this family of the PCAD clade (Fig. 4). This observation carries several evolutionary and functional implications discussed below.

Memo clade
Contextual information in the form of conserved gene neighborhoods and domain architecture for this clade is provided in Fig. 7.

Classical Memo family
The classical Memo family is the only one in the PCAD-Memo superfamily clearly traceable to the LUCA. It is found across most major lineages of life, including a monophyletic branch of the family in the archaea that spans all known archaeal lineages. Against this backdrop, it is notable that certain terminal branches of life appear to have either lost the Memo clade or possess an alternative PCAD clade enzyme that appears to have displaced Memo (see the AMMECR1 fusion family above, and Fig. S2).
In prokaryotes, classical Memo associates broadly with the AMMECR1 domain either as a neighboring gene or in a direct fusion in the same polypeptide. Furthermore, these are also linked to a radical SAM family domain, which is fused to a C-terminal zinc-ribbon domain and a distinct N-terminal cysteine-rich domain (Figs. 5U and 7A). This three-gene neighborhood is broadly conserved across bacteria and some euryarchaea and thaumarchaea (supporting data). Although the core three-gene configuration is the overwhelmingly dominant association for the three components, each gene also has unique associations, wherein it may combine with one of the remaining two genes or occur in entirely independent contexts (Fig. 7, C-M, and see below). However, even in these cases, the three core components show a striking shared phyletic pattern, i.e. co-presence in the same genome (Fig. S2), suggesting that even when not linked, the three core components are likely to act in a certain common pathway in the cell (see below).
The FUNCOUP system (65) predicts a functional association between Memo and AMMECR1 in eukaryotes, predominantly stemming from shared mRNA co-expression patterns and high-throughput protein-protein interaction data along with support from shared transcription factor-binding sites across several genomes. We observe further that the radical SAM fam-ily enzyme of the three-gene prokaryotic island is present in some eukaryotes, including the diplomonad Giardia and the parabasalids Trichomonas and Tritrichomonas, as well as in Entamoeba and Blastocystis. However, the remaining eukaryotes have lost this enzyme (supporting data). We predict these detected eukaryotic radical SAM domain orthologs of the prokaryotic versions above are likely to interact with AMMECR1 and Memo in the limited set of eukaryotes that possess them. It is possible that the three-gene apparatus was inherited as a single unit in the last eukaryotic common ancestor with the radical SAM component displaced by an as-yet undetermined gene relatively early in eukaryote evolution. In this light, it is also of interest to note that in many, but not all, eukaryotes the AMMECR1 domain is fused to a novel N-terminal cysteine-rich domain that does not exist outside of this architectural context (Fig. 7E and supporting data). Its absence in the eukaryotes retaining the radical SAM family enzyme suggests a possible compensatory role for the domain.

Other families of the Memo clade
The remaining four families of the Memo clade, which remain uncharacterized in the current literature, share with the classical Memo family the conserved cysteine between H6 and H7 along with a minimal loop region in VR1 (Table 1). Their relatively restricted phyletic distributions (supporting data) indicate these are likely recent derivations from the ancestral classical Memo family. Two of these families, Dole_0299 and CT1369, have genome contextual associations that are minimally informative (supporting data).
The third of this set of families, the CapA/PgsAA domain fusion family, is characterized by a frequent direct C-terminal fusion to a calcineurin-like phosphoesterase domain specifically related to the version found in the CapA/PgsAA proteins ( Fig. 7N and supporting data). The calcineurin-like phosphoesterases catalyze a broad range of activities, including nuclease, nucleotidase, phosphatase, sphingomyelin phosphodiesterase, and 2Ј,3Ј-cAMP phosphodiesterase activities (66,67). CapA/ PgsAA has been previously observed as part of a conserved four-gene operon that synthesizes poly-␥-glutamate (PGA). As part of this operon, CapB/PgsB is a well-characterized peptideligase that initiates and extends glutamate chains by activating them through the ATP-dependent formation of an acylphosphate (68,69). Although CapA/PgsAA has been linked to increasing turnover of the peptide-ligase reaction (69,70), its role is largely unclear. Given that it is a calcineurin phosphoesterase, we propose that it might function by hydrolysis of inappropriately phosphorylated carboxylates of glutamate during PGA chain elongation. Its association with the Memo clade is independent of the other genes involved in PGA synthesis. Across representatives of the recently-characterized radiation of novel bacteria (71), this family and the associated CapA/ PgsAA phosphoesterase domain is also linked in a conserved neighborhood to bacterial ␣ 2 -macroglobulin proteins ( Fig. 7O  and supporting data). The presence of signal peptides in representatives of this family as well as its diverse associating components suggests that together they form a periplasmic complex (see further discussion below, Fig. 7, N-P).
The fourth of these families is the PqqD-like fusion family, which features an N-terminal fusion to a PqqD-like winged

Natural history of the PCAD-Memo superfamily
helix-turn-helix (wHTH) domain ( Fig. 7Q and supporting  data). Although no other distinguishing structural, sequence, or genome contextual features were observed for the family, the PqqD-like domain is one member of a higher-order assemblage of peptide-binding wHTH domains (72,73). The PqqD proper domain was initially characterized in the binding of the PqqA peptide and its presentation to an enzyme cascade that generates the bacterial cofactor pyrroloquinoline quinone (PQQ) (74 -76). This family of Memo domains could therefore act on an unknown peptide bound by the fused PqqD-like domain.
Although there is an as-yet unknown dioxygenase in the PQQ biosynthesis pathway (77), there is no clear contextual linkage between this version of the Memo domain and other enzymes for PQQ biosynthesis to suggest that this is the missing dioxygenase.

Rv2728c-like family
Members of this newly-identified and uncharacterized actinobacteria-specific family lack several of the standard activesite residues, including those involved in metal coordination (Fig. 3, Table 1, and supporting data), suggesting they are inactive. They are typically observed in conserved neighborhoods with the MiaA adenylate isopentenyltransferase (78), the MiaB isopentenyl adenylate thiomethyltransferase, a diaminopimelate epimerase-like protein, and occasionally a distal association with a GTPase of the HflX family (Fig. 7R). Although this version of the PCAD-Memo domain is likely incapable of coordinating a metal ion and therefore catalyzing the dioxygenase reaction, the presence of some conserved residues in the activesite pocket (Fig. 3 and Table 1) suggests it could still be involved in binding/recognition of a ligand, a trend commonly-observed in inactive members of several enzymatic families (79).
The other components encoded by this conserved gene neighborhood point to a function for this family in the synthesis or modification of an adenine nucleotide via the action of the MiaA-and MiaB-like enzymes. It is possible that the Rv2728clike family domain plays a role in binding substrates for this reaction. The occasional association with HflX ( Fig. 7R and supporting data), recently characterized as a 100S ribosome dissociation factor (80), could suggest a link to base modification of rRNA bases at the ribosome. Alternatively, we cannot rule out that this pathway could produce an actinobacterial secondary metabolite via modification of a nucleotide-derived, cytokinin-like molecule with an aromatic ring modification, potentially in conjunction with the action of a SLOG base-releasing enzyme (81,82).

Functional analysis of the PCAD-MEMO superfamily
Certain enzyme superfamilies explore substrate space extensively, catalyzing essentially similar reactions on diverse substrates. Other enzyme superfamilies explore reaction space, catalyzing a diverse set of reactions on either an essentially similar set or a diverse set of substrates (83). The PCAD clade of the superfamily under consideration displays the former situation. However, shifts in reaction mechanism and substrate specificity might have happened earlier in the course of their divergence from the Memo clade and even earlier from the catalytically distinct PNP and peptidyl/amidohydrolase clades. We explore these aspects of the superfamily in greater detail below along with novel functional predictions inferred from genomic contexts.
Genome contexts for the LigA-associating subfamily, as mentioned above, primarily point to roles in terminal PCA degradation. This is supported by the associations with several other PCA degradation enzymes, including 4-carboxy-4-hydroxy-2-oxoadipate aldolase, 4-oxalomesaconate hydratase, 2-pyrone-4,6-dicarboxylate hydrolase, and 4-carboxy-2-hydroxymuconate-6-semialdehyde dehydrogenase (Fig. 5C). However, this subfamily instead occasionally associates with components of pathways upstream of this core PCA meta-degradation pathway. For example, it might be linked with vanillate O-demethylase and vanillate monooxygenase enzymes, which catalyze the conversion of vanillate to PCA (Fig. 5D and supporting data). This either points to a direct linkage between lignin and PCA meta-degradation or it could point to this version of the classical LigB family being at the juncture of pathways processing multiple substrates. Additionally, the LigAassociating subfamily is also found in carbazole degradation pathway contexts, associating with carbazole 1,9a-dioxygenase and 2-hydroxy-6-oxo-6-phenylhexa-2,4-dienoate hydrolase (Figs. 5E and 6J and supporting data).
Beyond these, we also discovered gene-neighborhood associations pointing to a role for some LigA-associating subfamily Natural history of the PCAD-Memo superfamily members in previously-unrecognized pathways: 1) in degradation of catechol through a possible muconate intermediate. The specific genome context we identify here contains a muconolactone ␦-isomerase enzyme and a likely 3-oxoadipate-enollactonase of the ␣/␤-hydrolase fold (Fig. 5F), suggesting direct conversion of catechol to muconate by the PCAD dioxygenase before action by the isomerase enzyme. In previously-studied pathways, this dioxygenase reaction is catalyzed by catechol 1,2-dioxygenase, an unrelated transthyretin fold-containing dioxygenase (91). This suggests a functional equivalence between the two dioxygenases (Figs. 5B and supporting data). 2) We also observe a conserved linkage between primarily actinobacterial members of this subfamily and a citrate synthase-like enzyme, CoA transferase, and a Rossmann-like domain that is most closely related to crotonyl-CoA carboxylase/reductases (Fig. 5G). This could possibly represent an unexplored pathway that shunts an aromatic ring substrate, possibly through conversion to oxaloacetate or an upstream molecule, into the citrate metabolism pathway (supporting data).
In the insert containing subfamily of the classical LigB family, we occasionally observe variation in their standard gene neighborhoods. These suggested that the cis-3-(3-carboxyethenyl)-3,5-cyclohexadiene-1,2-diol intermediate is processed by the 3-phenylpropionate dioxygenase and 2,3-dihydroxy-2,3-dihydrophenylpropionate dehydrogenase enzymes in lieu of the 3-(3-hydroxyphenyl)propionate hydroxylase (mphA) enzyme found in the more commonly observed configuration ( Fig. 5H and supporting data). The PhnC enzyme, originally characterized as acting on bicyclic, i.e. naphthalene diols (87) (Fig. 6J), is found in this variant context, suggesting that it might also utilize the same benzoate derivative substrates as other well-studied members of this subfamily.
Analysis of phyletic patterns suggests that the helical insert emerged later and potentially physically displaced the LigA subunit. This probably went hand-in-hand with the emergence of the generally separate catalytic specialization within the two subfamilies. However, we observed several striking "crossovers" wherein gene neighborhoods typical of one subfamily contained a representative from the other (Fig. 5, I-J, and supporting data). For example, several predominantly actinobacterial LigA-associating subfamily members are found in association with components of the trans-cinnamate degradation pathways typical of the insert-containing subfamily (Fig.  5I). Conversely, certain insert-containing subfamily members associate with components involved in the carbazole degradation pathway otherwise typical of the LigA-associating subfamily (Fig. 5J). This suggests that despite the structural specialization within the two clades, the classical LigB family scaffold supports substrate promiscuity (20,35,38) allowing for occasional functional crossovers to have occurred. This experimentally-supported promiscuity probably relates to these enzymes being selected as part of "first-responder" pathways against atypical aromatic metabolites or xenobiotics in the environment.

Predicting functions for the YgiD family
Beyond the role in betalain synthesis (see above) (92,93), the roles of the widespread YgiD family have remained unclear.
Our analysis uncovered multiple conserved genome contexts that suggest that the YgiD-like family likely catalyzes aromatic diol oxidation in multiple pathways (Fig. 5, Q-T). First, across a range of ␥-proteobacteria YgiD is found adjacent to the YgiC protein. YgiC contains an ATP-grasp fold amidoligase domain related to GSH synthetase (39,94,95). Several additional predicted membrane-spanning channel proteins are often encoded in the genomic vicinity of this YgiC-YgiD gene-dyad ( Fig. 5Q and supporting data). Second, across several proteobacteria as well as sporadically in the actinobacteria, firmicutes, and planctomycetes, YgiD is linked to a protein containing the DoxD/DoxX-like transmembrane domain and a transcription factor that combines the N-terminal HTH domain (96) and a C-terminal ligand-binding PBP domain (97). This genome context might additionally feature either an ␣/␤-hydrolase fold protein or a GSH S-transferase ( Fig. 5R and supporting data). Notably, the DoxD/DoxX-like domain appears to be displaced in several proteobacteria as well as in some Bacteroidetes by a YceI-like protein, a ␤-barrel-forming member of the lipocalin fold with an N-terminal signal peptide (Fig. 5S).
Common to these themes is the association of YgiD-like proteins with diverse membrane or secreted proteins, which indicate that they likely function proximal to the membrane. Such associations might be interpreted as distinct mechanisms for the utilization of available aromatic compounds. Association with YgiC can be interpreted as the coupling of the dioxygenase activity to the modification of a metabolite with an aromatic ring by an amidoligase reaction catalyzed by the associated ATP-grasp domain (39). In contrast, DoxD/DoxX proteins found in the second major theme have been previously observed as a subunit of the DoxA-like p450 monooxygenase in the quinol:oxygen oxidoreductase complex (cytochrome aa 3 ). The DoxA-DoxD pairing further works in concert with the DoxC transmembrane (TM) protein and the DoxB Riesketype dioxygenase in the thiosulfate:quinone oxidoreductase complex, coupling reduction of dioxygen with thiosulfate oxidation with a decyl ubiquinone as an electron acceptor in the archaeon Acidianus ambivalens (98,99). However, the systems with YgiD-like proteins differ from the above Dox systems in also being coupled to the ␣/␤-hydrolase or GSH S-transferase (Fig. 5R). Several ␣/␤-hydrolase fold proteins also catalyze a metal-independent dioxygenase reaction on oxoquinolines, and these could potentially catalyze a comparable reaction (100). Thus, the action of YgiD-like enzymes in these systems on aromatic diol-containing compounds could be coupled with either modification of the metabolite via glutathionylation or with further oxidation. This is consistent with the presence of a single-component HTH ϩ PBP transcription factor, which suggests that the system is likely regulated by the direct sensing of such a metabolite by this protein (Fig. 5R). The YceI protein that displaces DoxD/DoxX in several gene neighborhoods (Fig.  5S) has been implicated in the trafficking of polyisoprenoid quinones, which are rife in bacteria as cofactors, most notably as part of respiratory electron transport at the membrane (101). Thus, the secreted YceI-like and DoxD/DoxX proteins might be functionally equivalent and help deliver polyisoprenoid quinones to the cell surface (Fig. 5, R-S, and supporting data),

Natural history of the PCAD-Memo superfamily
thereby coupling them to utilization of an aromatic compound oxidized by the YgiD-like family through a redox pathway.
In addition to these major associations, YgiD-like proteins are also observed in a few contextual themes that are much more restricted in their phyletic distribution. These themes are suggestive of the salvage/catabolic pathways typical of other PCAD families (supporting data). One such theme, exclusive to archaeal Sulfolobus species, combines the YgiD-like proteins with 3,4-dihydroxyphenylacetate 2,3-dioxygenase and 4-hydroxyphenylacetate 3-hydroxylase enzymes (Fig. 5A), pointing to a possible dioxygenase activity for these domains on a substrate like homoprotocatechuate.

Complex web of contextual connections characterizes Memo and its functional partners
Despite its widespread phyletic pattern and strong conservation, the Memo clade by far remains the most functionally obscure in the entire superfamily. Hence, we thoroughly analyzed the conserved gene-neighborhood and domain architecture associations for Memo and its two other functional partners AMMECR1 and the radical SAM enzyme beyond their linkage in the core three-gene system (see above and Figs. 5F and 7). We observed the following notable conserved associations for Memo.
1) Memo and AMMECR1, but not the radical SAM family domain, associate with a SPOUT domain methylase and a distinct radical SAM enzyme closely related to the MiaB family, often in conjunction with a tRNA gene ( Fig. 7C and supporting data). This association is observed in the recently characterized radiation of bacteria termed the so-called "dark matter" of bacterial phylogeny (71).
2) In euryarchaea, crenarchaea, and thaumarchaea, the Memo gene by itself is located adjacent to either the ribosomal superoperon or another operon coding for mevalonate kinase, isopentenyl-diphosphate ␦-isomerase, isopentenyl phosphate kinase, and polyprenyl synthetase or between both these conserved operons. The latter operon's products are components of a recently-characterized, archaea-specific alternative pathway for isopentenyl diphosphate (IPP) production ( Fig. 7D and supporting data) (102)(103)(104). Based on the latter association, it was proposed that the Memo domain could act as a phosphomevalonate decarboxylase, an enzyme missing from this alternative pathway but essential for IPP production (102,104). However, efforts to demonstrate this activity in Memo have not succeeded. Moreover, subsequent research has identified the decarboxylase for the alternative pathway in at least the euryarchaeon Haloferax volcanii (105). We identified a distinct, previously-uncharacterized member of the GHMP (Galacto-, Homoserine-, Mevalonate-, and Phosphomevalonate-) kinase fold in a subset of these neighborhoods from crenarchaea and predict these are the likely phosphomevalonate decarboxylases (106) at least in these organisms (supporting data).
3) Secreted versions of the Memo domain belonging to the CapA/PgsAA-domain fusion family show consistent associations with the so-called DUF3160 module (Fig. 7, N and P), which is also found in further genome associations with diverse membrane-associated domains, many of which are potentially involved in binding of peptidoglycan or lipid polysaccharides. 5 In some spirochetes, an arginase domain fused to an AMMECR1 domain is further combined with the DUF3160 and Memo (Fig. 7P and supporting data). The DUF3160 module combines a helical N-terminal domain that features ␣-helices with a small, largely ␤-strand C-terminal domain. The long N-terminal ␣-helices of DUF3160 are suggestive of direct binding to a hydrophobic molecule like an isoprenoid derivative.
AMMECR1 is a member of the RAGNYA fold and features an absolutely-conserved cysteine residue predicted to be required for its catalytic activity (15). Even when occurring independently of Memo, it displays striking parallels to gene-neighborhood associations of Memo.
1) The AMMECR1 gene by itself is linked with the gene coding for a polyprenyl synthetase in several distinct prokaryotic lineages. Although this polyprenyl synthetase is specifically related to the above-mentioned versions from archaea, they do not occur with any other components of the IPP production pathway ( Fig. 7F and supporting data).
2) In several archaea, AMMECR1 is combined with a gene encoding a metal-dependent TIM barrel hydrolase specifically related to the enzymes involved in base metabolism such as dihydropyrimidinase, dihydroorotase, and allantoinase (Fig.  7G). Often, these gene-neighborhoods also feature a SPOUT methylase, a Mut7-C-type PIN domain nuclease, and a quinolinate/nicotinate phosphoribosyltransferase (Fig. 7G), with the latter enzyme associating with AMMECR1 independently of the TIM barrel hydrolase in euryarchaea (supporting data).
3) We had earlier identified a conserved genome association between AMMECR1 and the DNA-glycosylase fold NFACT module together with the Mut7-C PIN endoRNase domain across a range of archaea (Fig. 7H) (16). 4) We also observed an association between a signal peptidecarrying AMMECR1 protein and an ␣ 2 -macroglobulin domain protein ( Fig. 7B and supporting data) (107). This notably parallels the comparable associations of the secreted versions of Memo (Fig. 7I).
The radical SAM enzyme associated with Memo and AMMECR1 is most closely related to the pyruvate formate lyase-activating enzyme. This and the related radical SAM families have been implicated in catalyzing the modifications of (poly)peptides, co-factors, and RNA (108) by means of the reactive radical generated by the cleavage of SAM (109). Its associations include the following.
1) When present with its usual partners Memo and AMMECR1, it might additionally associate with a spermine synthase domain (Fig. 7B) often embedded between the three core members of the neighborhood (Fig. 7B and supporting data). Additionally, it often associates with the same spermine synthase domain in gene-neighborhoods independently of Memo or AMMECR1 (Fig. 7J, and supporting data). In several cases, it can be fused to the twin-arginine translocation (TAT) signal peptide, which is required for translocation of fully folded proteins or oligomeric complexes across the membrane (Fig. 7B). This suggests that the TAT-associated versions are likely to function in the periplasm or extracellularly.
2) The radical SAM enzyme might associate with an ApbElike flavin transferase, the predicted metal-binding NusG-II domain, and a predicted multi-TM transporter protein. The

Natural history of the PCAD-Memo superfamily
ApbE and NusG-II domain proteins bear signal peptides suggestive of being secreted. In certain bacterial lineages, this neighborhood appears to further include AMMECR1 or the polyprenyl-diphosphate synthase (Fig. 7K).
3) The radical SAM enzyme might be fused to the C-terminal periplasmic-binding protein/chelatase fold domain (PBD). It further associates in the same predicted operon with an additional PBD protein with a signal peptide, and a Rossmann fold methylase (Fig. 7L). Other than to the PBD, the radical SAM domain can also be directly fused at the C terminus to an uncharacterized ␤-strand-rich domain or the above-mentioned N-terminal TAT signal peptide (Fig. 7M).

Deciphering the functions of Memo from its contextual connections
The phyletic patterns of Memo, AMMECR1, and the radical SAM family together with their operonic and domain fusion associations suggest that all three of them were likely present in the LUCA. Such a phyletic pattern is strongly indicative of their combined involvement in universal processes of fundamental importance to the cell. The totality of the available genome contextual and experimental evidence point in two general directions: 1) involvement in a nucleotide or base modification reaction, as suggested previously (16 -18), or 2) a role in a modification involving an aliphatic isoprenoid derivative. It is also possible that both roles are valid and that they come together in the context of the modification of a base by an isoprenoid derivative. Support for both these roles is observed in the persistent but more sporadic associations of the three core components of the Memo system with the other genes that we report above. For instance, the AMMECR1 association with the NFACT module, which has been implicated in the clearance of jammed ribosomes in eukaryotes via C-terminal alanine and threonine tail formation (110,111) and the Mut7-C PIN domain RNase might support a role in RNA base modification coupled with endonucleolytic processing (16). This is echoed by the association of the Memo gene with the ribosomal superoperon (Fig. 7A). This is also consistent with the association we report with the SPOUT methylase and a MiaB-like radical SAM enzyme (Fig.  7A), both of which catalyze base modifications found in RNA (108,112,113). The association between AMMECR1 and the TIM barrel hydrolase, which belongs to a family involved in catalyzing base amidohydrolase reactions, again links it to base modification. In contrast, the persistent, independent, and combined connection of Memo, AMMECR1, and the radical SAM enzyme to polyprenyl synthetase enzymes is suggestive of modification of an isoprenoid or isoprenoid derivative (114,115). For the more sporadically distributed paralogs of the three individual components, the associations point to a cell membrane or extracellular localization (Fig. 7, B-D, and supporting data). This observation could be consistent with the modification of membrane-linked isoprenoids (116,117) or prenylated cell-surface proteins (114).
In conclusion, we posit that Memo together with its two partners catalyzes a conserved modification of aliphatic isoprenoids. The structure of the Memo protein reveals an active site with at least two pockets suggesting that it might interact with two distinct substrates (Fig. 2B). The active-site residues of Memo are congruent to the PCAD dioxygenases but have a distinguishing cysteine in place of the acidic residue between H6 and H7 (which convergently re-emerged in the PCAD clade AMMECR1 fusion family). It is conceivable that the isoprenoid substrate is cross-linked to this cysteine, whereas Memo catalyzes a reaction involving it and a second substrate. This is likely followed by a more drastic rearrangement mediated by the radical SAM enzyme. AMMECR1 probably mediates a transferase reaction via its proposed catalytic cysteine. However, the diversity of observed contexts suggests that the same reactions could be catalyzed in multiple distinct contexts on distinct substrates. A final functional consideration with the potential to bridge the two central themes would entail the isoprenoid modification of a base, consistent with the recent identification of widespread geranylation of RNA in bacteria (118) by the SelU transferase. This modification is vital for the subsequent selenylation of tRNA (118 -121). The enzyme system described above could conceivably parallel SelU with the AMMECR1 conserved cysteine acting in a capacity similar to the conserved cysteine in the SelU rhodanese domain. As possible support for this proposal, we observe a degree of negative correlation in the phyletic profiles of the Memo-AMMECR1-radical SAM system and the SelU transferase (Fig. S2).

Evolutionary themes in the PCAD-MEMO superfamily
Origin and early evolution of the superfamily-Profileprofile sequence similarity searches return hits of borderline significance between distant superfamilies across the PNP-peptidyl/amidohydrolase fold (see Fig. 4, Table S1, and "Experimental procedures"). However, structural similarity searches emphatically unify all versions of this fold. These also reveal a conserved core topology and several synapomorphies, most notably a shared active-site pocket and conserved catalytic residues projecting into the active site from the region downstream of strand S1 ( Fig. 2A and Table S1). Through these comparisons, we also infer that the PCAD-Memo superfamily first emerged from an ancestral PNP-peptidyl/amidohydrolase domain through a circular permutation event. Analysis of the PNP-peptidyl hydrolase fold proteins reveals that they had already considerably diversified by the time of the LUCA. Reconstruction of the evolutionary history of the fold suggests a possible ancestral function in tRNA-peptidyl hydrolase activity, given the early-branching, catalytically-comparable lineages found in both bacteria and archaea (Fig. 4). Multiple independent emergences of peptidase/amidohydrolase activity are inferred to have occurred at different points in the evolution of the fold, with two other functional shifts predicted to have occurred prior to the LUCA: 1) emergence of a nucleotide phosphorylase function in the PNP lineage and 2) emergence of the PCAD-Memo lineage, which may have initially retained a nucleotide-acting role before shifting to dioxygenase activity (Fig. 4). The emergence of diverse functions in this fold may coincide with structural rearrangements arising from circular permutation events and loss or gain of specific elements associated with the edges of the central ␤-sheet ( Fig. 4

and Table
Natural history of the PCAD-Memo superfamily S1). The propensity for the domain to dimerize may have played a role in stabilizing and fixing circular permutations that followed duplication events. Distinct circular permutations are seen in the zinc-dependent exopeptidase assemblage and in the PCAD-Memo superfamily (Fig. 4).
In structure and sequence, the versions closest to the PCAD-Memo superfamily are as follows: 1) the D-amino acid peptidyl hydrolase domains that hydrolyze D-aminoacyl tRNAs; 2) the PAC2 proteasomal chaperones; 3) PNP domains that catalyze the phosphorolytic or hydrolytic cleavage of the base from the sugar of a nucleoside at the 1Ј position. These share a core topology with the PCAD-Memo superfamily, featuring the strand-helix element in vicinity of VR2 ( Figs. 2A and 4). Thus, the circularly permuted version of the fold found in the PCAD-Memo superfamily appears to be nested within a clearly-definable radiation in the PNP-peptidyl/amidohydrolase fold. Our analysis also showed that the Memo clade emerged prior to the LUCA. This finding therefore implies that the permutation resulting in the divergence of the PCAD-Memo superfamily from unpermuted versions of the PNP-peptidyl hydrolase fold happened prior to the LUCA (Fig. 4). Memo is linked via genome contexts to both base and aliphatic isoprenoid modification (Fig. 7, A-E).
Post-LUCA diversification and expansion of the superfamily-In bacteria, the ancient, widely-conserved Memo lineage underwent diversification yielding four distinct families (Fig. 4). Each of these four families was then subsequently dispersed across bacteria via HGT. At least one of these late-branching families appears to have been recruited to a clear role in isoprenoid modification at the membrane or the cell surface, potentially as an offshoot of a more generalized role for the ancient Memo lineage in this capacity. Furthermore, the largely nonoverlapping phyletic patterns for these four families raise the possibility of them performing comparable or equivalent functions in the respective lineages that possess them.
After the Memo clade, our analysis suggests that the YgiDlike family is likely the next earliest-emerging extant family and the potential progenitor of the rest of the PCAD families (Fig.  4). The observed monophyly of a small yet diverse set of archaeal YgiD-like proteins raises the possibility of its presence in the LUCA. Although some basal eukaryotes contain YgiD clade members, their sporadic distribution across the rest of the superkingdom, coupled with a lack of observed monophyly of the eukaryotic versions with each other or with the archaeal versions in phylogenetic analyses, suggests a later origin for the clade in bacteria with one or more subsequent transfer(s) from bacteria to archaea and eukaryotes. Such a scenario could be consistent with their specialized role in L-DOPA utilization and betalain biosynthesis in plants and fungi (32, 59, 61). Nevertheless, the L-DOPA ring-opening activity of the YgiD clade indicates that the emergence of this clade probably marked the origin of the classic aromatic diol/amino-phenol oxidizing capacity in the PCAD-Memo superfamily. In conclusion, the links to base modification for the Memo clade hinted in our analysis might mark the initial evolutionary transition from ancestral enzymes of PNP-peptidyl/amidohydrolase with potential roles in peptidyl hydrolysis during translation or removal of bases from nucleosides to enzymes that might oper-ate in base modification. This transition was marked by the emergence of the multiple conserved histidines in the active site. The isoprenoid modification and membrane/cell-surface association for the Memo clade might have provided the ancestral adaptations for the emergence of the YgiD-like clade to catalyze the oxidative aromatic ring-opening proximal to the membrane as suggested by the above-reported contextual associations (Fig. 5A).
The emergence of the PCAD clade was accompanied by structural and sequence divergence associated with the variable regions in the core scaffold (Fig. 2). Family-specific sequence conservation patterns in these variable regions might have contributed to substrate versatility and/or opening avenues for allosteric regulation ( Table 1, see above) (23). This structural diversification might have in turn been driven by selective forces coming from the availability of new environmental compounds rich in polyphenolic rings such as lignin, which emerged on multiple occasions in the larger plant lineage (122). The ability to utilize such compounds was then broadly disseminated via HGT. An alternative pathway for the diversification of these enzymes was their recruitment to novel functional niches through displacement of existing dioxygenases in the utilization of certain metabolites, such as in the p-cumate degradation pathway (Figs. 5H and 6N). This colonization of multiple aromatic compound utilization pathways in many of these families manifests a certain versatility in terms of their substrates. This versatility takes two forms: 1) intra-family substrate promiscuity, where a single family or even a single member of a family can process multiple substrates, sometimes in multiple distinct steps in the same pathway, and 2) the ability of the distinct families across the clade at large to operate on a wide range of distinct phenolic compounds.
Despite the functional and structural sequence diversification observed in the PCAD clade, one later-emerging family in the clade, as well as potentially the Rv2728c-like family, is predicted to perform functions comparable with the Memo clade (Figs. 4, 5, and 7). This indicates that despite the diversification of the PCAD clade, attained through relaxation of some structural constraints, certain intrinsic features of active-site architecture pre-dispose the domain toward participating in the ancestral Memo-like function.
Finally, even though the fold shared by the PCAD-Memo superfamily and the PNP-peptidyl/amidohydrolases emerged prior to the LUCA, the emergence and diversification of the PCAD clade proper happened much later, primarily in the bacteria. This temporal pattern of radiation is comparable with another major class of dioxygenases, namely the 2-oxoglutarate-dependent dioxygenases and the Jumonji-like domains, which emerged from within the more ancient double-stranded ␤-helix fold featuring several sugar-binding proteins and sugar isomerases (123). We had postulated that the widespread availability of molecular oxygen with the primary oxygenation event in earth's history might have favored the radiation of those dioxygenases. The phyletic patterns that we report here favor a similar driver for the radiation of PCAD clade concomitant with the emergence of active oxidative metabolism utilizing O 2 (Fig. 4).

Conclusions
The results presented here provide the first complete evolutionary classification for the PCAD-Memo superfamily, resolving its origins and the relative temporal diversification of its major families. Despite increasing interest in members of the classical Memo family, particularly in light of its potential reported roles in cell motility and disease in human cells, its molecular function remains poorly understood. Herein, for the first time we propose unifying explanations for the uncharacterized Memo clade as well as the enigmatic members of the PCAD clade and expand the scope of the roles of representatives of the superfamily. We hope the predictions provided here regarding catalytic mechanisms and potential substrates will guide future experimental investigations.

Experimental procedures
Bona fide protein sequences of the PCAD-Memo superfamily were retrieved from the GenBank at the National Center for Biotechnology Information (NCBI) (124), and structures were retrieved from the PDB (125). These were used to seed iterative PSI-BLAST (126) and JackHMMER (127) searches against the nonredundant (nr) database using an expected (e)-value threshold of 0.01. New sequences detected in these searches were subject to reciprocal BLAST to confirm their inclusion in the superfamily. Profile-profile searches performed with the HHpred program (128) were also used to detect remote homologs, confirm membership of divergent members within the superfamily, and identify higher-order relationships between lineages in the PNP-peptidyl/amidohydrolase fold, searching against the PFAM (129) and PDB (125) databases. Structure similarity searches were performed using the DALI server (130). Structures were rendered, compared, and superimposed in the molecular visualization program PyMOL. Multiple sequence alignments were generated using the Kalign (131) and Muscle (132) programs with default parameters. These multiple sequence alignments were adjusted manually, guided by structure superimpositions, profile-profile alignments, and secondary structure prediction. The JPred (133) program was used to predict secondary structures.
An approximate maximum-likelihood method as implemented in the FastTree (134) program with default parameters was used to assess intra-family phylogenetic relationships. The FigTree program (http://tree.bio.ed.ac.uk/software/figtree/) 6 was used to render phylogenetic trees. Co-occurring domain architectures and gene neighborhoods were retrieved through custom PERL scripts from the NCBI Genome database.
Clustering of protein sequences and the subsequent assignment of sequences to distinct families was performed through two methods. In the first method, the sequence similarity networks of PCAD-Memo domains were constructed using Pythoscape (135). Domains that are fused to PCAD-Memo domains were removed, because the added length artificially decreases e-values of multidomain proteins, making the singledomain proteins appear more divergent than multidomain pro-teins. Including all domains when calculating the network also introduced a spurious connection between the MEMO family and the PCAD AMMECR1 fusion family from the PCAD clade, because they share a common domain. In some cases, families identified by BLASTCLUST (below) dissociated into multiple clusters in the sequence similarity network, because sequence divergence at domain boundaries made it difficult to cleanly excise the additional domains, or because variation in sequence divergence within and between families prevented identification of a single BLAST e-value cutoff that recapitulated the BLASTCLUST family assignments, without either fragmenting some families or fusing others. After removing additional domains, the data set was filtered to 50% sequence identity. Representatives that share Ͻ50% identity were used to construct the networks by pairwise BLAST using an e-value cutoff of 10 Ϫ10 or 10 Ϫ30 . The networks were visualized in Cytoscape using the Organic layout (136). In the second method, families were defined using the BLASTCLUST program (ftp://ftp. ncbi.nih.gov/blast/documents/blastclust.html), adjusting the length of aligned regions and bit-score density threshold empirically. Divergent sequences or small clusters of sequences typically belonging to the same phylogeny were added to a family if other lines of evidence, including shared sequence motifs, shared structural synapomorphies, reciprocal BLAST search results, and/or shared genome context associations supporting inclusion (Table 1 and supporting data). Divergent sequences for which no such evidence was observed were left without family assignment. The results from the two techniques displayed good concordance, both recovering the primary PCAD-Memo clade split and identifying roughly the same total number of families in the superfamily.