PrenDB, a Substrate Prediction Database to Enable Biocatalytic Use of Prenyltransferases*

Prenyltransferases of the dimethylallyltryptophan synthase (DMATS) superfamily catalyze the attachment of prenyl or prenyl-like moieties to diverse acceptor compounds. These acceptor molecules are generally aromatic in nature and mostly indole or indole-like. Their catalytic transformation represents a major skeletal diversification step in the biosynthesis of secondary metabolites, including the indole alkaloids. DMATS enzymes thus contribute significantly to the biological and pharmacological diversity of small molecule metabolites. Understanding the substrate specificity of these enzymes could create opportunities for their biocatalytic use in preparing complex synthetic scaffolds. However, there has been no framework to achieve this in a rational way. Here, we report a chemoinformatic pipeline to enable prenyltransferase substrate prediction. We systematically catalogued 32 unique prenyltransferases and 167 unique substrates to create possible reaction matrices and compiled these data into a browsable database named PrenDB. We then used a newly developed algorithm based on molecular fragmentation to automatically extract reactive chemical epitopes. The analysis of the collected data sheds light on the thus far explored substrate space of DMATS enzymes. To assess the predictive performance of our virtual reaction extraction tool, 38 potential substrates were tested as prenyl acceptors in assays with three prenyltransferases, and we were able to detect turnover in >55% of the cases. The database, PrenDB (www.kolblab.org/prendb.php), enables the prediction of potential substrates for chemoenzymatic synthesis through substructure similarity and virtual chemical transformation techniques. It aims at making prenyltransferases and their highly regio- and stereoselective reactions accessible to the research community for integration in synthetic work flows.

Prenyltransferases of the dimethylallyltryptophan synthase (DMATS) superfamily catalyze the attachment of prenyl or prenyl-like moieties to diverse acceptor compounds.These acceptor molecules are generally aromatic in nature and mostly indole or indole-like.Their catalytic transformation represents a major skeletal diversification step in the biosynthesis of secondary metabolites, including the indole alkaloids.DMATS enzymes thus contribute significantly to the biological and pharmacological diversity of small molecule metabolites.Understanding the substrate specificity of these enzymes could create opportunities for their biocatalytic use in preparing complex synthetic scaffolds.However, there has been no framework to achieve this in a rational way.Here, we report a chemoinformatic pipeline to enable prenyltransferase substrate prediction.We systematically catalogued 32 unique prenyltransferases and 167 unique substrates to create possible reaction matrices and compiled these data into a browsable database named PrenDB.We then used a newly developed algorithm based on molecular fragmentation to automatically extract reactive chemical epitopes.The analysis of the collected data sheds light on the thus far explored substrate space of DMATS enzymes.To assess the predictive performance of our virtual reaction extraction tool, 38 potential substrates were tested as prenyl acceptors in assays with three prenyltransferases, and we were able to detect turnover in >55% of the cases.The database, PrenDB (www.kolblab.org/prendb.php),enables the prediction of potential substrates for chemoenzymatic synthesis through substructure similarity and virtual chemical transformation techniques.It aims at making prenyltransferases and their highly regio-and stereoselective reactions accessible to the research community for integration in synthetic work flows.Prenylated primary and secondary metabolites, including indole alkaloids, flavonoids, coumarins, xanthones, quinones, and naphthalenes, are widely distributed in terrestrial and marine organisms.They exhibit a wide range of biological activities, including cytotoxic, antioxidant, and antimicrobial activities (1)(2)(3).Compared with their non-prenylated precursors, these compounds usually demonstrate distinct and often improved biological and pharmacological activities, which makes them promising candidates for drug discovery and development (1, 2, 4 -6).These compounds could be considered hybrid molecules of prenyl moieties of different chain lengths (n⅐C 5 , where n is an integer number) and aromatic skeletons originating from various biosynthetic pathways (7,8).Prenyl transfer reactions (i.e. the connections of prenyl moieties to the aromatic nucleus) are catalyzed by a diverse family of prenyltransferases.Interestingly, this step usually represents the key transformation in the biosynthesis of such compounds.A prenyl moiety can be attached by prenyltransferases via its C1 (regular prenylation) or C3 (reverse prenylation) to carbon, oxygen, or nitrogen atoms of an acceptor (Fig. 1, A and B) (7,8).Together with the observed regiospecific prenylations at different positions of an acceptor molecule, prenyltransferases contribute significantly to the structural and biological diversity of natural products (7).
Based on their amino acid sequences and biochemical and structural characteristics, prenyltransferases are categorized into different subgroups (7).In the last decade, significant progress has been achieved with the members of the dimethylallyltryptophan synthase (DMATS) 5 superfamily, and Ͼ40 enzymes of this group were identified and characterized by mining of fungal and bacterial genomes (7).These enzymes catalyze transfer reactions of a prenyl moiety from prenyl diphosphate (e.g.dimethylallyl diphosphate (DMAPP)) to diverse acceptors, such as tryptophan, tyrosine, tryptophancontaining cyclic dipeptides, xanthones, tricyclic or tetracyclic aromatic moieties, or even non-aromatic compounds.Among the acceptors, indole derivatives, including tryptophan and tryptophan-containing cyclic dipeptides, are substrates of most of the DMATS enzymes investigated so far (7,9).
One of the challenges in discovery and use of DMATS enzymes as biocatalysts in a rational and targeted manner is the prediction of the acceptance of a putative substrate.On the one hand, the enzymes share similar structures, albeit often at low sequence identities, and catalyze, in many cases, similar reactions.On the other hand, different enzymes with similar natural substrates accept further non-native aromatic substances with clearly different activities (7).Therefore, bioinformatic and chemoinformatic approaches for the prediction of the catalytic activity of these enzymes are welcome and necessary to harness the full biosynthetic potential of this enzyme class.
We describe in this work the creation and evaluation of a database that catalogs and stores prenyltransferase reaction information.Because storage of the reactions is automated, the database is not static but will grow with each new reaction described in the literature.Furthermore, we present an application of PrenDB, where we predict and validate putative substrates for prenyltransferases (Fig. 2).

Results
PrenDB Statistics-Digitalization and chemoinformatic encoding of enzymatic reactions of the DMATS superfamily allow for a deep analysis of their substrate space and reactivity toward distinct chemical epitopes (Fig. 3).In total, 32 unique enzymes were found throughout the literature examined.The three most prominent prenyltransferases in terms of the number of annotated transformations are 7-DMATS, FgaPT2, and SirD, accounting for 15, 14, and 13% of the reactions in the database, respectively.At the other end of the spectrum, there are seven enzymes for which only a single reaction has been published.With respect to promiscuity, the number of unique reactive epitopes-molecular substructures centered around the reactive atom and henceforth called repitopes in this workwas used as a descriptor (cf.space.Knowledge of the reactive atom and its surroundings makes a further distinction of enzymatic transformations possible; the Sankey diagram in Fig. 3 shows how the cataloged reactions not only can be linked to their prenyltransferases, but also can be subdivided into types per the reactive atom.The clear majority of reactions (87%) corresponds to the regular type of prenyl moiety transfer (Fig. 1B, right), where the thermodynamically more stable regioisomer is formed.In 73% of all reactions and in all reverse attachments, the reactive atom is a member of a ring system (endo).Only a small part of regular prenylations occur at exocyclic atoms (exo, 26%).There, prenyl moieties are transferred onto oxygen and nitrogen atoms of tyrosine and aniline-like moieties by SirD, FtmPT2, or 4-DMATS.Reverse prenylation can be observed at carbon and nitrogen atoms only.They are incorporated in aromatic ring systems and less frequently also in alicyclic moieties, such as benzoquinones.More than 60% of all reactions occur at aromatic carbon atoms; derivatives of indole, including tryptophan, are the most frequent repitopes in this largest reaction subclass.Much rarer are prenylations at nitrogen (8%) or at exocyclic oxygen atoms (23%).By comparison, 5% (33 entries) of reactions of tryptophan-like moieties (e.g. at atom position C2 in brevianamide F (E1) (Fig. 1A) lead to formation of compounds with a fused ring system, where the atomic environment of the reactive atom becomes dearomatized during the reaction (FtmPT1, Fur7, and 4-DMATS).Substrate Space-Throughout the analyzed literature (44 articles from 17 journals), 167 unique substrates were found.To analyze the substrate space diversity of prenyltransferases, a similarity matrix based on the pairwise ECFP4 (11) fingerprint molecular similarity was calculated (Fig. 4, Table 1, and supplemental Table S1), followed by a hierarchical clustering.This allows the grouping of substrates based on their chemical structure.From the corresponding dendrograms and supported by the reorganized distance matrix, five substrate classes distributed over 14 clusters can be deduced: (i) unsubstituted indoles, derivatives of tryptophan and proline-tryptophan diketopiperazines; (ii) derivatives of tyrosine with modifica- to the substrate space, respectively.Furthermore, the space spanned by the fragments obtained via bond cleavage during the fragment-based subgraph isomorphism perception process (cf."Materials and Methods") is covered to an extent of 64% by tryptophan and diketopiperazine epitopes.This predominance of indoles can be explained by tryptophan and indole derivatives being the native substrates for 78% of the enzymes in PrenDB.This, combined with the DMATS bias in the literature, eventually leads to strongly indole-biased data.Fig. 5 shows the knowledge about prenyltransferase reactions, as digitalized and stored within PrenDB, in terms of catalyzed transformations (combinations of a particular substrate and an enzyme) and the corresponding yield achieved.In the top right corner of the matrix, transformations of the most abundant substrates (tryptophan, tyrosine, diketopiperazines, and their derivatives, respectively) together with the most promiscuous enzymes (7-DMATS, FtmPT1, CdpNPT, SirD, and FgaPT2) can be found.At the same time, the matrix is sparse (i.e.contains a lot of blanks).This sparsity is the result of the availability of data and thus represents the research focus of the prenyltransferase field in the past.It presents a challenging starting situation for model building.
Repitopes-For each of the 665 cataloged reactions (each defined as a unique triplet of a substrate, product, and enzyme), a repitope (i.e.reactive epitope; see "Materials and Methods" for a complete definition) was extracted using the reactive atom detection and repitope reconstitution routines of the algorithm developed in this work (see "Materials and Methods").Each repitope comprises four reconstitution depths (from 2 to 5 bond distances).Over all repitope depths, 276 unique repitopes, defined by their unique SMARTS string, were extracted.A SMARTS string is a one-dimensional encoding of chemical substructures and an efficient way to store a complete definition of each repitope (see "Materials and Methods").A reconstitution depth of 5 delivered the largest contribution to the repitope space with 135 (49%) members.This is consistent with expectations, because larger depths will lead to more diverse descriptions.Depths 2, 3, and 4 account for 25 (9%), 69 (25%), and 94 (34%) repitopes, respectively.Interestingly, the sum over all amounts of each reconstitution level is greater than the total number of unique repitopes.This means that distinct combinations of the substrate molecule, its reactive atom, and the depth level are not mutually exclusive, thus resulting in duplicate entries.Fig. 6 shows the distribution of repitopes for the various depth levels of reconstitution, underlining the relationship between repitope size and diversity.Although the largest class of repitopes contributes the most to repitope space, also smaller, more general and ambiguous repitopes are among the top 10 most frequent repitopes; substructures of tyrosine, benzene, and indole are at ranks 3, 5, 6, 7, and 9, respectively, whereas complete and extended indole and tyrosine structures are at ranks 1, 2, and 4 and, at 8, 10, and 11.
Prediction of Novel Substrates via a Multistep Screening Procedure-In a sequential application of virtual screening tools (Fig. 2), beginning with prenylation prediction through repitopes stored in PrenDB and concluding with docking into three prenyltransferases with a known crystal structure (FgaPT2, FtmPT1, and CdpNPT), 38 virtual hits were selected through the following procedure.(i) A compound was considered as a virtual hit if any PrenDB repitope could be found within its molecular framework at least once.Supplemental Table S2 shows the number of repitopes matching a particular hit, with a high repitope hit rate indicating promiscuous compounds (i.e.molecules that are classified as substrates of multiple enzymes).Using repitopes based on a reconstitution depth of 3, 168,906 compounds were selected in this first step.(ii) Comparison of molecular properties with those of known substrates and removal of molecules outside the respective ranges (Table 2) reduced the number of virtual hits to 90,559.Going beyond one-and two-dimensional molecular descriptors and ensuring that the (iii) three-dimensional shape (judged by a high score in the OEChem shape congruency tool; see "Materials and Methods") matched between putative and known substrates led to a selection of 451 compounds.This repitope-, property-, and shape-based determination of prenylation potential of the selected compounds was further refined by the (iv) docking results; for each compound, an optimal enzyme structure for docking was selected based on a compound's structural overlap with the co-crystallized substrate.The amount of this overlap was quantified by the same shape congruency methodology mentioned above but was automatically invoked from within the docking application HYBRID (see "Materials and Methods").The generated poses, from which 38 molecules were selected for experimental validation, show a distinct geometrical consensus of the key interactions with the enzymes (Fig. 7): first, polar interactions  Novel Substrates for Prenyltransferases FgaPT2, FtmPT1, and CdpNPT-To assess the predictive performance of our virtual screening, the 38 potential substrates were tested as prenyl acceptors in enzyme assays with the tryptophan prenyltransferase FgaPT2 and the two tryptophan-containing cyclic dipeptide prenyltransferases FtmPT1 and CdpNPT.The selected substances clearly differ structurally from the substrates for the DMATS prenyltransferases reported previously (7, 10).The     S2 and Fig. 8, 23 of these substances were accepted by FtmPT1, 22 by FgaPT2, and 25 by CdpNPT.In relation to the number of hits selected from our virtual screen, this corresponds to a hit rate of 60.5% for FtmPT1, 57.9% for FgaPT2, and 65.8% for CdpNPT.Product yields of Ͼ50% were observed for 12 substrates with FtmPT1, 7 with FgaPT2, and 10 with CdpNPT, respectively.The prenylated products can be detected in a straightforward manner as signals in their corresponding mass spectra; their [M ϩ H] ϩ ions are shifted by 68 Da relative to their educts.Overall, we thus obtained high hit rates and yields Ͼ50% in the case of 29 reactions (25% of all attempted reactions).

Similarity Analysis of Known Substrates and Selected
Compounds-To assess the novelty of the 38 selected compounds, the similarity with the substrate space cataloged within PrenDB (167 substrates) was calculated and visualized by generating a similarity matrix based on the ECFP4 fingerprintbased Tanimoto similarity (Fig. 9, left).The matrix shows an overall low similarity score between our selection and the known substrate space.This points toward the potential to access truly novel substrate space by employing repitopes.Of note, the similarity is higher in columns corresponding to compounds that were successfully prenylated in our assays by at least two of our test enzymes.The right panel of Fig. 9 shows similarity scores as calculated by our in-house fragment-based method RedFrag (12); in contrast to ECFP4, RedFrag compares the fragmental composition of molecules and the two-dimensional arrangement of fragments.RedFrag accentuates the commonalities and differences between known substrates and our selections.Compound 1, a tryptophan-homoproline-diketopiperazine (94.1% yield on FtmPT1), shows high similarity scores with tryptophan, its indole core derivatives, and, expectedly, tryptophan-tryptophan-, tryptophan-alanine-, tryptophan-glycine-, and tryptophan-proline-diketopiperazines from clusters 1, 2, and 3, respectively.Of note, RedFrag emphasizes the similarity of compound 1 and cluster 3 based on the presence of the indole scaffold.In contrast, ECFP4 emphasizes the dissimilarity of this compound originating from the absence of the diketopiperazine motif in the same cluster.Compounds 12, 26, and 32 show high similarity with substrates from cluster 3 (also 9, 10, and 11).Compounds 12 and 32 are regioisomers of a brominated tryptophan derivative.They show distinctly different yields: 14.4 and 8.5% for 12 in CdpNPT and FtmPT1; 47.5 and 44.7% for 32 in FtmPT1 and FgaPT2, respectively.It is evident that the position of the bromine atom has a major impact on the role of such compounds as substrates.The influence of regiochemistry of indole core substitutions or singleatom replacements at this core is further exemplified by compound 26.Its benzothiophene moiety (replacing the nitrogen atom in an indole by a sulfur atom) is not accepted as a substrate by any of the three test enzymes.
Selected compounds with low similarity but remarkable yields indicate novel substrate classes or motifs; compound 16 shows a good yield in FtmPT1 and moderate yields in FgaPT2 and CdpNPT (68.8, 34.8, and 29.7%, respectively).Its conjugated indole-4-imidazolin-2-one motif has no similar counterparts within the known substrate space.This is also true for compound 30 (yields of 60.1, 59.  incubations with two indole derivatives, indole 3-ethyl ethylsulfonamide ( 21) and 6-benzyloxyindole (30), which were very well (98.6% yield) and moderately (60.1% yield) accepted by this enzyme, respectively (Fig. 10 and supplemental Table S2).As shown in Fig. 11A, a single dominant peak was observed in the LC-MS chromatograms of the incubation mixtures, which were isolated on a Multospher 120 RP-18 column for structure elucidation. 1 H NMR data revealed, surprisingly, the presence of two compounds in each reaction mixture.21a and 21b originated from 21 in a ratio of 1:1 from the reaction mixture, and 30a and 30b originated in a ratio of 1:0.85 from 30.Further purification resulted in four pure products.Through NMR and MS analyses, the structure of 21a was subsequently elucidated as a regularly N 1 -prenylated derivative.The second product, 21b, was identified as a reversely C 3 -prenylated derivative with a simultaneous cyclization of C2 of the indole with the nitrogen atom of the side chain located at C3, resulting in the formation of a 6/5/5 fused ring system.Compounds 30a and 30b were proven to be regularly C 2 -and C 5 -prenylated derivatives, respectively.These results unequivocally proved specific prenylations at the indole ring without or with additional modifications, such as cyclization.Detailed studies of the relationships of enzymes, substrates listed in supplemental Table S2, and their products are under further investigation.
A comparison of the elucidated structures of the products 21a, 21b, 30a, and 30b with the PrenDB-predicted prenylation sites of their corresponding educts reveals that the prenylation site was correctly predicted in two of four cases.However, the responsible enzyme, FtmPT1, was only proposed for the prenylation of 21 to yield 21a.In the case of 30, the product 30b was predicted to originate from the enzyme CdpNPT or FgaPT2.

Discussion
This study demonstrates the power of systematically organizing and analyzing diverse and disparate experimental enzymatic data by means of chemoinformatic methods.Besides a comprehensive repository of the existing knowledge about prenyltransferase reactions, the determination of repitopes allowed us to predict novel substrates that are distinctly different from the ones that have been identified previously, both natural and synthetic.Moreover, we achieved an overall high hit rate of 71% in terms of molecules that were accepted by at least one prenyltransferase.However, it has to be noted that the repitopes stored in PrenDB are not yet accurate enough in all cases to precisely predict the correct enzyme and/or the correct reactive atom.This shortcoming is presumably correlated with the comparatively small number of instances in the database.Although the existing body of literature clearly represents a considerable experimental effort, chemistry and the biochemical reactivity of enzymes are so diverse that even higher numbers of substrate-enzyme-product triplets would be necessary to obtain more complete repitopes that also account for the different reactivity of certain substructures.The chemoinformatic strategy that we employed in this work is certainly flexible enough to accurately model more fine-grained patterns.
At the same time, a database such as PrenDB can provide excellent help in determining which reactions and substrates would be worthwhile to test next.On a basic level, one could simply be guided by the number of reactions already described for each enzyme and focus on the underrepresented ones.But also more sophisticated approaches can be envisioned; enzyme phylogenetic trees could be based not on amino acid sequence, but on substrate similarity.Further exploration would thus focus on filling in the missing links.Ultimately, such strategies might merge with machine learning approaches, where the algorithm itself would suggest which enzyme-substrate pairs to test next based on the maximum information gain of each investigation.
Nevertheless, despite some shortcomings, our database and resulting prediction algorithm are already useful for correctly predicting a large number of substrates and thereby aiding in the creation of novel chemical matter.The imprecisions could also be taken as a strength in this context, in that they allow for serendipitous discoveries (e.g. the reverse prenylation of 21 to 21b by FtmPT1).
Last, it has to be emphasized that the concept of repitopes and their fragment-based determination can easily be extended to other enzymatic reactions.The automatic processing of potentially large numbers of reactions and the concomitant conversion into the reaction principles (i.e.repitopes) will lead   (21) was regularly prenylated at position N1, leading to 21a, and reversely prenylated at position C3 with a simultaneous formation of a 6/5/5 fused ring system (21b).Typically for C3-prenylation at indole substructures, a dearomatization and intramolecular cyclization accompany the prenylation reaction.B, 6-benzyloxyindole (30) was regularly prenylated at positions C2 and C5.
to facile systematizations and gain of knowledge from the analyses of the emerging data.
The high hit rates (58 -66%) for each enzyme and the fact that one-fourth of the reactions had a yield of Ն50% demonstrate the excellent performance of our knowledge-based repitope approach.The combination of PrenDB and its ligandbased approach with protein structure-based tools, such as docking, therefore seems to constitute a powerful combination of strategies.Furthermore, these results prove the potential usefulness of the tested enzymes for the production of prenylated derivatives.

Materials and Methods
The prenylation reaction as conducted by the enzymes of the DMATS superfamily formally corresponds to a substitution reaction occurring on carbon, oxygen, and nitrogen atoms of small metabolites through the transfer of small apolar moieties (denoted as dma (short for dimethylallyl) in the following example).The leaving group is always a pyrophosphate (PP Square brackets enclose individual atoms, and adjacent atoms are taken to be linked by covalent bonds.Letters denote elements, gB is the general base, and PP i is pyrophosphate; commas represent a logical OR; and numbers are arbitrary labels to allow for unambiguous tracking of each atom.In the above example, it can be seen that the hydrogen atom with label 2 ([H:2]) is substituted by the dma group and moves from its adjacent carbon, oxygen, or nitrogen atom (labeled 1) to the general base (label 5).Fig. 1A illustrates this general transfor-  (GA in Fig. 1C) into the prenylated product (GP), at the same time generating a protonated general base (PGB) and pyrophosphate (PP) as by-products.Although feasible in silico, a chemical transformation based solely on the reactive atom is unreasonable and ambiguous; in reality, only carbon, nitrogen, and oxygen atoms located within the correct atomic surroundings can undergo prenylation.Thus, the entire molecule, or at least a crucial motif within it, is necessary to completely characterize a reactive environment.We call such a set of atoms consisting of the reactive atom, to which the transferred moiety will be attached, and its neighboring atoms a "reactive epitope" (repitope for short).In the case of the transformation of E1, the corresponding repitope is shown in Fig. 1D.The specification of the carbon atom can now be extended to its full repitope notation, where lowercase letters denote membership of an atom in an aromatic system, HX indicates the presence of X adjacent hydrogen atoms, and parentheses indicate branching of the molecular framework.From this notation, it can be concluded that the reactive atom is The corresponding SMARTS notation for each repitope could in principle be deduced by hand, given the chemical structures of the substrate and product molecules.To achieve an efficient handling of several hundred enzymatic transformations, with only the two-dimensional structures of substrate, product, and the transferred moiety as input, an automated procedure for the extraction of transformation SMARTS, and thus repitopes, appears to be as indispensable as it is difficult to accomplish.A fully specified repitope requires knowledge of the reactive atom as well as its surroundings.Repitope deduction can be accomplished by applying subgraph isomorphism-based algorithms followed by the reconstitution of the chemical environment.Both steps (reactive atom perception and repitope reconstitution) will be described in detail below.All coding was done in python.For chemoinformatic calculations, the python wrappers of the RDKit (open source toolkit for chemoinformatics; accessed March 16, 2016) library were utilized.Fingerprint-based similarity calculation was carried out with the OEChem toolkit (OpenEye Scientific Software, Inc., Santa Fe, NM).

Perception of the Reactive Atom
In the case of a simple linear substitution reaction, as depicted in Fig. 1A, the reactive atom can be found by mapping the molecular structure of the substrate molecule onto the molecular structure of the product.For non-symmetric molecules, this leads to a unique match with an atom-to-atom correspondence between substrate and product.Because the number of atoms in the product is always greater than the number in the substrate, the substrate is a substructure of the product (i.e. its complete molecular skeleton can be found within that of the product).With the same approach, the atom-to-atom correspondence between the transferred moiety (i.e. the prenyl group) and the product molecule can be obtained.The intersection of the atom-to-atom matched sets of the substrate and the transferred moiety consists of only one atom, the reactive atom (Fig. 12A).If, however, the enzymatic transfer of a moiety is accompanied by a subsequent (or concerted) rearrangement of the molecular skeleton of the product (e.g. a cyclization), the substrate cannot be considered to be a direct substructure of the product anymore.Thus, the reactive atom can no longer be determined through the atom-wise substrate-to-product mapping as described above.In such a case, a possible strategy for establishing a substructure correspondence (i.e. a subgraph isomorphism) would be to weaken atom or bond type matching criteria.The resulting atom-to-atom correspondences are allowed to be more general in that way but are often ambiguous at best.To circumvent this problem, we assumed that, although the entire substrate may undergo dramatic changes in its molecular skeleton, smaller structural motifs (molecular fragments) remain unaffected by such transformations and can therefore still be unambiguously mapped onto the substrate structure before and after the reaction.Fig. 12B illustrates the consecutive steps in this fragment-based substructure isomorphism approach.In contrast to the aforementioned subgraph isomorphism based on the entire substrate structure, an additional fragmentation step has to be performed.Breakable bonds (bonds connecting ring systems with other ring systems or with acyclic motifs (see Fig. 13 (A and B) for exemplary fragmentations and the breakable bond definition)) are cleaved, leading to a set of fragments.An intermediate filter step ensures that very small fragments (single atoms, linker moieties, terminal groups) are not considered further.The remaining fragments are mapped onto the product structure, and their atom-toatom correspondence is investigated for an intersection with the atom mappings of the transferred moiety and the product molecule.By preserving the atom-to-atom correspondences between the substrate molecule and its fragments, the reactive atom can be identified by finding the intersecting atom of one of the matching fragments in the substrate.

Reconstitution of the Reactive Epitope
As already mentioned, knowledge of the reactive atom alone is only of limited use for substructure searches or virtual transformations, because both methods yield ambiguous results when only a single atom is given as input.It is therefore necessary to rebuild the chemical environment of the reactive atom to obtain a description of a particular transformation that has discriminative power.To obtain such a description, the reactive atom is augmented with additional atoms from its first, second, third, (etc.)neighbor shells (Fig. 14) (i.e. by traversing the atomic neighborhood of the reactive atom up to a fixed distance (i.e.number of bonds)).The traversed atoms are then extracted as a molecular subset and converted into a regular molecular object and, eventually, a SMARTS string.Different depths of reconstitution lead to either small and nonspecific repitopes (d ϭ 1) or larger and more stringent ones for large depths (d Ͼ 3).In extremis, at the largest possible depth, the repitope becomes identical to the molecule itself.Thus, a balance has to be found, and the most useful repitopes are able to represent reasonable chemical environments for a particular reaction but still allow for certain flexibility and diversity in retrieving putative substrates.

Database of Prenylation Reactions
With the algorithmic tools to deconstruct a given transformation catalyzed by a prenyltransferase into a reaction SMARTS and the corresponding repitopes in hand, investigation of as many transformations as possible can readily be conducted.Hence, we decided to create a database (PrenDB) storing the known transformations in an efficiently browsable and queryable manner.For this purpose, a literature search was performed to extract substrates, products, enzymes, and available metadata (such as kinetics and yields) from 44 publications (full articles, reviews, and communications) across 17 journals.Each enzymatic reaction is represented by the SMILES strings (13) of the product and substrate molecules, combined with the preferred name of the involved enzyme.The advantage of using [ 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18] [  12. A, substrate-based subgraph isomorphism.The substrate structure matches the product as a whole.The intersection of atom overlaps between substrate and prenyl moiety delivers the reactive atom (orange arrows, index 9).B, fragment-based subgraph isomorphism.Substrate structure is fragmented into smaller epitopes preserving the substrate-fragment atom matchings.By matching fragments onto the product and analyzing the intersection with the prenyl moiety, the reactive atom can be found within the structure of the substrate (orange arrows, index 1).
SMARTS is that each reaction can be visualized and processed with common chemoinformatic software.Furthermore, each reaction entry contains multiple repitopes, generated with the aforementioned algorithm and different environmental depths (2-5 bonds around the detected reactive atom).This reaction table (a table is a collection of database entries that are semantically equal) is supported by and connected with further tables holding metadata extracted from the literature and/or calculated with chemoinformatic tools (Fig. 15).The molecule dictionary table comprises all small molecules involved in the reaction: substrates, products, transferred moieties (such as DMAPP or benzylpyrophosphate), and fragments.Additionally, each entry comes with a molecular properties table, where basic physico-chemical properties can be looked up.The reference table contains the literature used for data extraction together with hyperlinks to articles and entries on PubMed and UniProtKB.PrenDB can be browsed and extended with python scripts bundled with the algorithms for repitope generation described in this work (or more conveniently via a web interface) in a straightforward manner.Because of access speed and portability considerations, we decided to use the sqlite3 backend as the underlying database architecture and the Django python package for middleware and frontend.

Virtual Screen for Putative Substrates of Prenyltransferases
To predict novel substrates for transformation by a prenyltransferase, a multistep screening process was carried out with a subset of the ZINC database (14), which stores commercially available small molecules in ready-to-be-processed formats.First, the ZINC clean leads database, with a total of 5.1 million compounds, was filtered for the presence of any of the extracted repitopes from PrenDB.The repitope depth was 3. The screening was carried out as substructure searches using the python wrappers of the OEChem toolkit.Second, compounds were filtered utilizing the MolProp toolkit (OpenEye Scientific Software).Only compounds with physico-chemical properties within the range spanned by known substrates cataloged in   PrenDB were allowed (Table 2).The remaining compounds were subsequently submitted to the shape congruency analysis based on the OEChem API.In short, for each compound, a low energy conformer was generated, and its 3-dimensional overlay with each known substrate was optimized.Compounds with an overlay score Ͼ0.9, reflecting excellent three-dimensional shape matching, were allowed for the next step.Fourth, the remaining compounds were docked into the three most promiscuous prenyltransferases for which a crystal structure had been determined (FgaPT2, FtmPT1, and CdpNPT; Protein Data Bank codes 3I4X, 3O2K, and 4E0U, respectively), employing the multi-target HYBRID (15) engine; for each compound, up to 200 conformers were generated with OMEGA (16).The ensemble of conformations of each molecule was then overlaid with the co-crystallized ligand in each of the three selected crystal structures to determine the best suited enzyme for the following exhaustive docking.The method for overlaying conformers is built directly into the HYBRID engine and is based on the same methodology as implemented in the OEChem API and the ROCS application (17).For the actual docking step (translational and rotational optimization of a compound conformer within the binding site of the protein), HYBRID scores for a given protein-ligand complex were calculated based on the shape and electrostatic complementa-rity of the ligand and protein's binding site (Fig. 16).Shape and electrostatic features are represented by Gaussian potentials.During optimization, the overlap between ligand and protein features is maximized.After docking, calculated poses were visually inspected to remove those that form improbable interactions that are not sufficiently penalized by present-day scoring functions, and the selected compounds were acquired from their respective vendors and experimentally tested.

Experimental Validation
Chemicals, Bacterial Strains, and Culture Conditions-DMAPP was synthesized according to the method described for geranyl diphosphate reported previously (18).
. Design of PrenDB.The database tables are related to each other in a one-to-one (reactions and repitopes), one-to-many (substrates and reactions), or many-to-many (substrates and fragments) relationship, reflecting their real world correspondence.The central reaction dictionary holds the necessary data to encode a reaction based on substrate, product, and cofactor molecules, the enzyme, and the resulting repitope.A reference table is added to supplement the database with metadata and enhance its usability.Dashed lines, abstract inheritance; solid wedged lines, one-to-many relationships (e.g. a molecule can act as substrate in as many reactions as an enzyme).A repitope belonging solely to one particular reaction is indicated by a straight solid line (one-to-one relationship).
Enzyme Assays with Recombinant Proteins-In the assays to determine the acceptance of the different substrates, the enzyme reaction mixtures contained 50 mM Tris-HCl, pH 7.5, 10 mM CaCl 2 , 2 mM DMAPP, 2-7.5% (v/v) glycerol, 1-2% (v/v) DMSO, 1 mM aromatic substrate, and 0.4 mg⅐ml Ϫ1 purified recombinant protein in a volume of 100 l.The reaction mixtures were incubated at 37 °C for 16 h and terminated by the addition of an equal volume of methanol.The reaction mixtures were brought to dryness by vacuum evaporation and subsequently resuspended in 100 l of methanol and centrifuged at 13,000 rpm for 15 min.Five l of the supernatants were analyzed on LC-MS.
For isolation of the enzyme products, the reaction mixtures were scaled up to 10 ml, containing 50 mM Tris-HCl, pH 7.5, 10 mM CaCl 2 , 2 mM DMAPP, 2-7.5% (v/v) glycerol, 1-2% (v/v) DMSO, 1 mM aromatic substrate, and 0.4 mg⅐ml Ϫ1 purified recombinant protein, and incubated at 37 °C for 16 h.The reactions were terminated by the addition of 10 ml of methanol and brought to dryness by using a rotary evaporator at 37 °C.The residues were resuspended in 1 ml of methanol, centrifuged at 13,000 rpm for 15 min, and purified on an HPLC device.
LC-ESI-HRMS Analysis of the Reaction Mixtures-The treated enzyme reaction mixtures (5 l) mentioned above were analyzed on an Agilent 1260 Infinity HPLC system (Bo ¨blingen, Germany) in combination with a photodiode array detector and a Bruker micrOTOF-Q III mass spectrometer.For separation, a Multospher 120 RP-18 column (250 ϫ 2 mm, 5 m, CS-Chromatographie Service, Langerwehe, Germany) with a flow rate of 0.25 ml⅐min Ϫ1 was used.Water (solvent A) and MeCN (solvent B), both containing 0.1% (v/v) formic acid, were used for a linear gradient of 5-100% (v/v) solvent B in A in 40 min.Subsequently, the column was washed with 100% solvent B for 5 min and equilibrated with 5% (v/v) solvent B for 10 min.The separations were monitored with the Bruker micrOTOF-Q III mass spectrometer using the positive ion ESI.HPLC and MS data were processed by using Bruker Compass DataAnalysis version 4.2 (build 383.1) software.
Isolation of Enzymatic Products-Isolation of the enzyme products was performed on an Agilent HPLC series 1200.The separation was carried out on a MultoHigh Chiral AM-RP column (250 ϫ 10 mm, 5 m, CS-Chromatographie Service) with a flow rate of 1 ml⅐min Ϫ1 and different linear gradients of methanol in water.
NMR Analysis-The isolated enzyme products were brought to dryness by using a rotary evaporator at 37 °C and dissolved in 0.7 ml of CD 3 OD.NMR spectra were recorded on a JEOL ECA 500-MHz spectrometer (JEOL Germany GmbH, Munich, Germany).The signal of CD 3 OD at 3.31 ppm was used as an internal reference for chemical shifts.Data processing was done by using MestReNova version 6.0.2-5475software.

FIGURE 2 .FIGURE 3 .
FIGURE 2. Schematic of the underlying work flow.A, reactions of prenyltransferases were digitalized and stored in PrenDB.The reactive atoms were detected algorithmically and reconstituted to reactive epitopes (repitopes).Prenylation candidates were selected by a multistep virtual screening approach, and their transformation potentials were evaluated experimentally on selected prenyltransferases.B, multistep virtual screening of database compounds (colored triangles): (i) repitope-based substructure search based on the substrate space covered in PrenDB; (ii) one-dimensional property and three-dimensional shape comparison with known substrates; and (iii) docking screen on three promiscuous prenyltransferases with available crystal structure.

FIGURE 4 .
FIGURE 4. Hierarchical clustering of the substrates extracted from PrenDB.Individual clusters derived from the dendrograms are annotated with an image of the corresponding cluster representative.Magenta and brown bars indicate 14 detected clusters.Black horizontal lines on the leaves of the dendrogram indicate the number of molecules grouped together.
PrenDB, a Substrate Prediction Database MARCH 10, 2017 • VOLUME 292 • NUMBER 10 JOURNAL OF BIOLOGICAL CHEMISTRY 4009 with the general base Glu-89/102/116; second, occupation of the apolar indole-subpocket and hydrogen bond interactions in the vicinity of the opening of the active site, residues His-279 and Arg-244.

FIGURE 6 .
FIGURE 6. Distribution of repitopes of different depth in repitope space.Dashed lines, conjugated bonds.

a
Solubility categories (insoluble, poorly, moderately, soluble, very, highly) are derived from reparametrized atom types from the XLogP algorithm within the OEChem toolkit.b An aggregator is considered an exact match with one of approximately 400 published aggregators compiled in the OEChem toolkit.

FIGURE 7 .
FIGURE 7. Molecular features extracted from poses of the selected virtual hits.Left, red spheres, hydrogen bond acceptors; yellow discs, aromatic moieties.The majority of acceptor functionalities can be found around the basic residues Arg-255 and His-279.Right, blue spheres indicate hydrogen bond donors.They are located around the highly conserved glutamate (Glu-89/ 102/116) and in the vicinity of backbone carbonyls (omitted for clarity).
4, and 96.1%, respectively) and its benzylated hydroxyl-indole structure.Compound 27, a pyrimidine-indole, shows excellent yield in FgaPT2 (82.9%).Further examples with high yields but RedFrag similarity scores Ͻ0.6 are compounds 11, 13, 20, 23, and 33.Structure Elucidation of the Products of Compounds 21 and 30-To investigate at which position within a given substrate the prenylation occurred, we carried out exemplary FtmPT1

FIGURE 8 .
FIGURE 8. Transformation rates of virtual hits relative to L-tryptophan obtained for the three examined enzymes FtmPT1, CdpNPT, and FgaPT2.Horizontal bars, mean; vertical bars, S.D.; orange interval, S.E.; colored circles, data points.

FIGURE 9 .
FIGURE 9. Similarity matrix between the selected compounds and known substrates for prenyltransferases extracted from PrenDB.Left, ECFP4 fingerprint similarity.Middle, RedFrag scores calculated with ECFP4 fingerprints.Color coding (top), green, yield Ͼ50%; yellow, yield between 1 and 50%; gray, no transformation.Right, magenta and brown bars indicate 14 detected clusters.Black vertical lines on the leaves of the dendrogram indicate the number of molecules grouped together.
aromatic and is bound to one hydrogen atom; its direct neighbors are an aromatic nitrogen and another aromatic carbon without any attached hydrogen atoms.The second neighbor shell consists of two aromatic carbon atoms and one aliphatic carbon atom.This convenient one-line notation of chemical environments is called SMARTS (one-line molecular patterns; Daylight Chemical Information System, Inc. website; accessed March 16, 2016) and is widely used in the field of chemoinformatics, especially for substructure searches.Atoms, their properties, and binding characteristics are encoded with alphanumeric characters.Multiple molecule SMARTS together with the information about bond breakage and formation yield the SMIRKS of a reaction.With a repitope, such as the one described above, and the enzymatic transformation encoded in SMIRKS notation, it is possible to virtually transform any substrate molecule into its corresponding prenylation product or to easily search for putative substrates by invoking substructure-based virtual screens in publicly available vendor databases.

FIGURE 13 .
FIGURE 13.A, fragmentation rule expressed as SMARTS string.The rule consists of two atom definitions and a bond definition.Enclosed in square brackets (blue) are atoms connected by any bond type except for a ring bond (orange).The atom on the left can be of any type but must be a member of a ring system (hydrogen atoms are excluded indirectly because they are not allowed to form ring systems).On the right, the atom must not be a terminal atom (hydrogen atoms are excluded indirectly because they are always terminal).B, breakable bonds (orange) as defined by the fragmentation rule and the resulting fragments (blue) based on the substrates 21 and 30 and their corresponding prenylation products 21a and 30a.

FIGURE 16 .
FIGURE 16.Distribution of docking scores for each of the 42 initially selected poses.Each distribution (represented as a box plot) shows 10 scores (circles)/compound.Boxes embrace 50% of the scores, and horizontal lines (whiskers) cover 99.3% of the scores.Circles with black diamonds, outliers.

TABLE 1 Cluster size and substrate space coverage derived from hierarchical clustering
FIGURE 5. Coverage of enzyme-substrate space.Orange squares indicate transformation yields Ͼ10%, and cornflower squares show a yield Ͻ10% for the corresponding enzyme-substrate combination, respectively.White squares, absence of data.