Mapping the transition state for a binding reaction between ancient intrinsically disordered proteins

Intrinsically disordered protein domains often have multiple binding partners. It is plausible that the strength of pairing with specific partners evolves from an initial low to higher affinity. However, little is known about the molecular changes in the binding mechanism that would facilitate such a transition. We previously showed that the interaction between two intrinsically disordered domains, NCBD and CID, likely emerged in an ancestral deuterostome organism as a low-affinity interaction that subsequently evolved into a higher-affinity interaction before the radiation of modern vertebrate groups. Here we map native contacts in the transition states of the low-affinity ancestral and high-affinity human NCBD/CID interactions. We show that the coupled binding and folding mechanism is overall similar, but with a higher degree of native hydrophobic contact formation in the transition state of the ancestral complex and more heterogeneous transient interactions, including electrostatic pairings, and an increased disorder for the human complex. Adaptation to new binding partners may be facilitated by this ability to exploit multiple alternative transient interactions while retaining the overall binding and folding pathway.


Introduction
Intrinsically disordered proteins (IDPs) are abundant in the human proteome (1) and are frequently involved in mediating proteinprotein interactions in the cell. The functional advantages of disordered proteins include exposure of linear motifs for association with other proteins, accessibility for posttranslational modifications, formation of large binding interfaces and the ability to interact specifically with multiple partners. These properties make IDPs suitable for regulatory functions in the cell, for example as hubs in interaction networks governing signal transduction pathways and transcriptional regulation (2).
The ability to mutate while maintaining certain sequence characteristics might allow IDPs to more efficiently explore sequence space, facilitating adaptation to new binding partners. Despite recent appreciation of the biological importance of IDPs and progress in understanding their mechanism of interaction with other proteins, the changes that take place at a molecular level when these proteins evolve to accommodate new binding partners remain elusive. The reason for this is the inherent difficulty in assessing effects of mutations that a protein has acquired during millions or billions of years. However, the rapidly increasing number of available protein sequences from extant species has enabled the development of ancestral sequence reconstruction as a tool for inferring the evolutionary history of proteins (3). In combination with biophysical characterization of the resurrected proteins, this method has been employed successfully to deduce details in the evolution of numerous proteins (4)(5)(6)(7)(8)(9)(10). Ancestral sequence reconstruction relies on an alignment of sequences from extant species and a maximum likelihood method that infers probabilities for amino acids at each position in the reconstructed ancestral protein from a common ancestor.
However, due to less constraints for maintaining a folded structure, IDPs are often thought to experience faster amino acid substitution rates during evolution and an increased occurrence of amino acid deletion and insertion events as compared to folded proteins. In particular deletions and insertions obstruct reliable sequence alignments of IDPs (11)(12)(13). Nevertheless, as pointed out before (14), IDPs do not constitute a homogenous group of proteins and thus the degree of sequence conservation differs between different classes of IDPs. IDP regions that form binding interfaces in coupled binding and folding interactions are usually conserved because of sequence restraints for maintaining affinity and structure of the protein complex (15), and can therefore be subjected to ancestral sequence reconstruction.
TIF2 and ACTR, respectively). The interaction between NCBD and the CBP-interacting domain (CID) from NCOA3 (ACTR) has been intensively studied with kinetic methods to elucidate details about the binding mechanism (21-24). These protein domains interact in a coupled binding and folding reaction in which CID forms two to three helices that wrap around NCBD (18). The binding reaction involves several steps, as evidenced from the multiple kinetic phases observed in stopped flow spectroscopy and single molecule-FRET experiments (24-26). It is however not clear what structural changes all kinetic phases correspond to. In fact, equilibrium data agree well with Kd calculated from the major kinetic binding phase (22) in overall agreement with a two-state binding mechanism.
Previous phylogenetic analyses revealed that while NCBD was present already in the last common ancestor of all bilaterian animals (deuterostomes and protostomes). CID likely emerged later as an interaction domain within the ancestral ACTR/TIF2/SRC protein in the deuterostome lineage (including chordates, echinoderms and hemichordates; Fig. 1) (4,27). The evolution of the NCBD/CID interaction was previously examined using ancestral sequence reconstruction in combination with several biophysical methods to assess differences in affinity, structure and dynamics between the extant and ancestral protein complex. After the emergence of CID in the ancestral ACTR/TIF2/SRC protein, NCBD increased its affinity for CID while maintaining the affinity for some of its other binding partners (4). While the overall structure of the ancestral NCBD/CID complex, which we denote as "Cambrian-like", is similar to the modern high affinity human one, there are several differences on the molecular level ( Here, we address the molecular details of the evolutionary optimization of the binding energy landscape between NCBD and CID. More specifically, we have investigated evolutionary changes of the binding transition state by subjecting the Cambrian-like NCBD/CID complex to site-directed mutagenesis and kinetic measurements using fluorescencemonitored stopped flow spectroscopy. The Cambrian-like complex consisted of the maximum likelihood estimate (ML) NCBD variant from the time of the divergence between deuterostomes and protostomes, NCBDD/P ML , and the CID variant from the time of the first whole genome duplication, CID1R ML . The experimental data were used to estimate the degree of native intermolecular tertiary contacts as well as helical content of CID in the transition state, expressed as phi (f)-values. The f-value, which commonly ranges from 0 to 1, reports on formation of native contacts in the transition state and was originally developed for folding studies (29). However, it has also been used to gain detailed information about the transition state structures for several IDP binding reactions (30). Furthermore, to obtain an atomic-level detail of the binding and folding process, the f-values were employed for restrained molecular dynamics (MD) simulations. Our results suggest that while the overall binding mechanism and transition state are conserved, some modulation of helical content in the transition state and, most notably, an increased heterogeneity of transient intermolecular interactions is observed in the high-affinity modern human complexes as compared to the low-affinity Cambrian-like complex.

Experimental strategy
We have characterized the transition state of binding for the low-affinity Cambrian-like NCBD/CID complex in terms of f-values and compared it to that of the high-affinity modern human complex. A f-value can be employed as a region-specific probe for native contact formation in the transition state of a binding reaction. A f-value of 0 means that the entire effect on Kd from the mutation stems from a change in koff, indicating that the interactions with the mutated residue mainly forms after the transition state barrier of binding. On the other hand, a f-value of 1 is obtained when the effect of mutation on kon and Kd is equally large, which suggests that the mutated residue is making fully formed native contacts at the top of the transition state barrier. To characterize the transition state for binding, we performed extensive site-directed mutagenesis in the Cambrian-like complex and subjected each mutant to stopped-flow kinetic experiments to obtain binding rate constants (i.e., kon and koff) ( Fig. 1 c). These rate constants were used to compute f-values for each mutated position in the complex (Fig. 2 a). To provide more structural details about the evolution of the NCBD/CID interaction, we also determined the TS ensemble of the Cambrian-like complex via f-value-restrained MD simulations, following the same procedure previously used to obtain the TS ensemble of the human complex (23).
In the present study we used NCBDD/P T2073W as a pseudo wild type ( Fig. 1e; denoted NCBDD/P pWT ) to obtain a sufficient signal change in the stopped flow fluorescence measurements. While Trp pseudo wild types have been shown to be reliable in protein folding studies (31) it is less clear how much an engineered Trp would affect an IDP complex in a coupled binding and folding reaction. Therefore, we initially tested five different Trp mutants of NCBDD/P ML to select the one with properties most similar to NCBDD/P ML (Fig. S1).
To check how robust our f-values were to the position of the Trp probe we measured five fvalues with one of the other Trp variants, NCBDD/P L2067W . Despite a two-fold lower kon as compared to NCBDD/P pWT (i.e., NCBDD/P T2073W ) all five f-values were very similar, including the negative f-value obtained for D1068A ( Table 3), suggesting that our experimental fvalues are robust and not dependent on the optical probe.
We constructed two sets of mutants in this study to characterize the transition state for binding: one that targeted native contacts in the binding interface and a second that targeted native helices in CID. The first set consisted of 13 NCBDD/P pWT variants and 10 CID1R ML variants mainly corresponding to previously mutated residues in the human complex (21, 23). The positions spanned the entire binding interface of NCBD and CID to ensure that all regions of the protein domains were probed ( Table 1). The majority of the mutations targeted interactions between hydrophobic residues in the binding interface, however, a few mutations also probed interactions between charged residues. The second set of mutants consisted of Ala®Gly substitutions in helix 1 and 2 of CID1R ML , as well as helix 2 and 3 of CIDHuman complementing a previous data set for helixmodulating mutations in helix 1 of CIDHuman (32). The secondary structure content of all mutants was assessed with far-UV CD. All CID variants exhibited far-UV CD spectra typical for highly disordered proteins ( Fig. S2a-b). For NCBD, the far-UV CD measurements showed that NCBDD/P L2070A , NCBDD/P L2090A , NCBDD/P L2096A and NCBDD/P A2099G displayed substantially less a-helical structure than NCBDD/P pWT , as judged from the lower magnitude of the CD signal at 222 nm (Fig.  S2c). Addition of 0.7 M trimethylamine Noxide (TMAO) to the experimental buffer for these variants resulted in an increase in helical content such that the far-UV CD spectra of NCBDD/P L2070A and NCBDD/P L2096A were qualitatively similar to NCBDD/P pWT (Fig. S2d). For NCBDD/P L2096A and NCBDD/P A2099G stoppedflow kinetic experiments were carried out both in regular buffer and in buffer supplemented with 0.7 M TMAO. The resulting f-values were very similar for these variants at both conditions demonstrating both the accuracy and precision of f-values and how robust they are to different experimental conditions (Fig. S3). The complex of the NCBDD/P L2070A variant was too unstable in buffer and data was only recorded in presence of 0.7 M TMAO. Several mutants failed to generate kinetic data due to elevated kobs values, which were too high to be reliably recorded with the stopped-flow technique. This was the case for NCBDD/P L2074A , NCBDD/P L2086A , NCBDD/P L2090A , NCBDD/P I2101A , CID1R L1056A , CID1R L1064A , CID1R I1067A and CID1R L1071A .
One caveat with f-value analysis is the assumption that ground states are not affected by mutation. Hence, the strategy of using conservative deletion mutations is important (33). For example, mutation to a larger residue is not considered conservative and one of the Trp variants that we tested, NCBDD/P S2078W , displayed a very low kon (3.5 µM -1 s -1 ) and a 2fold lower koff than the other NCBD variants (Fig. S1). We can only speculate about the structural basis for this result but since the residue is situated in the loop between Na1 and Na2 it might either lock the two helices in relation to each other or flip over, cover the binding groove and thus block access for CID. In either case we have a clear effect on the ground state. The problem of ground state changes may be particularly pertaining for IDPs, since they are more malleable in terms of structural changes to accommodate binding partners. It is beyond the scope of this study to perform extensive structure determination of all site-directed mutants but we recorded HSQC spectra for one complex containing a typical deletion mutation and an intermediate f-value, NCBDD/P L2067A with CID1R ML (Fig. S4a-b). The spectra were perturbed by the mutation, with a less marked effect for CID1R ML . Considering that the amide proton and nitrogen resonances are very sensitive, together with our CD data, the spectra suggests that the structure of the complex between CID1R ML and NCBDD/P L2067A is compact and stable and possibly similar to the complex between CID1R ML and NCBDD/P pWT .
The transition state of the Cambrian-like complex is more native-like than the extant human complex. First, we assessed interactions formed by hydrophobic side chains in the binding interface of the ancestral Cambrian-like complex by computing f-values for each mutant ( Table 1). The resulting f-values were mainly in the intermediate category ranging between 0.3-0.6 ( Fig. 2a). This result was in contrast with previously published low f-values for the human complex (21), which ranged between 0-0.3 for similar conservative deletion mutations (Fig. 2b). A notable exception was the CIDHuman L1055A variant, which displayed a fvalue of 0.85. This residue is located in the first a-helix, which is transiently populated in the free state of CID and forms many native contacts with NCBD in the transition state (23, 34). The CID1R L1055A in the Cambrian-like complex displayed a similarly high f-value (0.83), suggesting that this region forms a conserved nucleus for the coupled binding and folding of CID/NCBD. Further comparison between f-values in the Cambrian-like and human complex at a site-by-site basis shows that three mutations in particular, NCBD L2067A , NCBD L2087A and NCBD L2096A , displayed large differences in f-values in the respective complex, with the Cambrian-like complex always showing higher f-values. The differences suggest rearrangements of native contacts in the transition state, resulting in a more disordered transition state for the human complex (Fig. 2c).
Accordingly, MD simulations showed that the Cambrian-like TS, as compared to the human one, is significantly more compact (as judged by gyration radius, Fig. 3e) and less heterogeneous (as judged by pairwise RMSD, Fig. 3d), supporting the idea that the TS for formation of the Cambrian-like complex is more native-like (Supplementary Video S1).
The higher NCBD f-values (Table S1 and Fig.  2c) measured for the ancestral complex resulted in differences in both the NCBD secondary/tertiary structure and the intermolecular interactions in the TS. In the ancestral TS, the NCBD helix Na3 is totally unfolded (consistent with its lower helicity also in the native state) (28), while helices Na1 and Na2 are well formed and maintain a native-like relative orientation with numerous Na1-Na2 contacts ( Fig. 3c and 3b, box 1). On the other hand, in the human TS, Na1 and Na2 show lower helical content and contacts are less frequent. The main intermolecular contacts are conserved in human and Cambrian-like TS: in both the cases we observed a stable hydrophobic core (Fig. 3b, boxes 2), involving Ca1 (via residues Leu1052 and Leu1055, which displays high f-value in both complexes) and Na1 (via residues Leu2071, Leu2074 and Lys2075), but with a slightly different orientation of the two helices. A second relevant interacting region involves the residues of the unstructured CID helices Ca2-Ca3, which contact NCBD in both TS ensembles. In the ancestral complex, hydrophobic residues of Na2 helix are preferred, while in the human TS the interactions are more dispersed, involving also the longer and structured helix Na3 (Fig.  3b, box 3). The identified interactions are highly native-like in the case of the ancestral complex ( Fig. S5) as compared to the human one (23), where a higher number of transient contacts is observed (Fig. S6). We note that the Brønsted plot, where ∆∆GTS is plotted versus ∆∆GEQ, appears more scattered for the Cambrian-like complex as compared to the human complex and previously characterized IDP systems. A salient feature of a nucleation-condensation mechanism in protein folding, where all noncovalent interactions form cooperatively, is a clear linear dependence of the Brønsted plot, while a scattered Brønsted plot suggests a less cooperative binding mechanism with for example pre-formed structure. However, the narrow range of ∆∆GEQ values for Cambrianlike NCBD/CID precluded a conclusive comparison of the scatter in the Brønsted plots of the human and Cambrian-like complex.
The mechanism of helix formation in CID has been well-conserved during evolution. The binding reaction of NCBD and CID is associated with a dramatic increase in secondary structure content of CID. Helical propensity of the N-terminal helix of CID correlates positively with affinity for NCBD (32) and modulation of helical propensity is likely an important evolutionary mechanism for tuning affinities of interactions involving IDPs. To investigate native helix formation in the transition state of the Cambrian-like complex, we introduced Ala®Gly mutations at surface exposed positions in helix 1 of CID1R ML (Ca11R) and in helix 2 (Ca21R). The mutants were subjected to binding experiments and the rate constants were used to compute f-values for each CID variant in complex with NCBDD/P pWT ( Table 1). The f-values for Ca11R were ranging between 0.2-0.3 and were lower, close to 0.1, for Ca21R (Fig. 4a). Brønsted plots resulted in slopes for Ca11R and Ca21R of 0.3 and 0, respectively (Fig. 4b). The slope of the Brønsted plot can be regarded as an average f-value and thus suggests around 30% and 0% native helical content of Ca11R and Ca21R, respectively, in the transition state for the Cambrian-like complex.
To facilitate a direct comparison with the extant human complex, we extended a previously published data set on helix 1 from human CID (Ca1Human) (32) with new Ala®Gly mutations in helix 2 (Ca2Human) and helix 3 (Ca3Human) ( Table 2). The human complex displayed fvalues for helix formation in Ca1Human that ranged from 0.3-0.7, whereas all f-values in Ca2Human and Ca3Human were close to 0 (Fig. 4c). This resulted in Brønsted plots with slopes of 0.5 for Ca1Human (32) and virtually 0 for Ca2Human/Ca3Human (Fig. 4d). Thus, in both the Cambrian-like and human complexes, the Nterminal Ca1 plays an important role in forming early intramolecular native secondary structure contacts in the disorder-to-order transition. These data are well represented by the ancestral TS ensemble. The probability of a-helix content in CID, measured via DSSP (35) and averaged over the whole ensemble, shows that only helix Ca1 is partially formed in the TS, while other CID helices are mostly unstructured (Fig. 3c). Analogously, in the TS of the human complex only helix Ca1 was folded, overall supporting the importance of Ca1 formation for NCBD binding.
We note that according to Brønsted plots, Ca1Human has a slightly higher helical content in the transition state compared to Ca11R, consistent with a higher helical propensity for Ca1Human than for Ca11R, as suggested by predictions using AGADIR (4). We further note that the A1075G mutation in Ca3Human did not display a large effect on Kd, which precluded a reliable estimation of a f-value. This could indicate either that Ca3Human contributes little to the stability of the bound complex or that the Ala®Gly substitution at this position promotes an alternative conformation that binds with equal affinity as the wildtype protein. Similarly, Val1077®Ala in Ca3Human was shown previously to have a small positive effect on the affinity for NCBD (21), suggesting structural re-arrangement in the bound state.

Role of a conserved and buried salt-bridge in the Cambrian-like complex.
Long-range electrostatic interactions promote association of proteins and play a major role in IDPs. Mutation of a conserved salt-bridge between Arg2104 in NCBD and Asp1068 in CID was previously shown to display large effects on the kinetics of complex formation for human NCBD/CID, both in terms of a 10-fold reduction in kon but also with the occurrence of a new kinetic phase (kobs ≈ 15-20 s -1 ). Analysis of the kinetic data favored an induced fit model, thus a conformational change after binding (24, 36,37). We generated the protein variants NCBDD/P R2104M and CID1R D1068A to assess the role of this salt-bridge in the ancestral Cambrian-like complex.
NCBDD/P R2104M and CID1R D1068A displayed clear biphasic kinetic traces in the stopped flow experiments, similarly to experiments with the corresponding mutants in the human complex. Fitting of the kinetic data to obtain kobs values revealed one concentration-dependent kinetic phase, which increased linearly with CID concentration and a second kinetic phase which was constant at kobs ≈ 16 s -1 over the entire concentration range. Thus, the kinetic data set for the complex between NCBDD/P R2104M /CID1R D1068A was fitted globally to an induced fit mechanism to obtain estimates of the microscopic rate constants. The comparison between the Cambrian-like and human complex revealed that the effect of mutating the buried salt-bridge was much smaller with regard to the association rate constant k1 for the Cambrian-like complex than for the human complex, less than 2-fold versus 20-fold, respectively. On the other hand, the slow phase was similar for both the human and ancestral complex with a kobs (= k2 + k-2) of 15-20 s -1 . However, global fitting suggested that the alternative conformation of the bound state is only slightly populated for the Cambrian-like complex (k-2>>k2). This was corroborated by ITC measurements, which showed that the overall Kd (5.1 ± 0.3 µM) is highly consistent with k-1/k1 (4.8 µM). Interestingly, while the salt-bridge is significantly populated in the native state simulations, it is not populated in either the Cambrian-like or the human TS ensemble, suggesting that the formation of this interaction is not relevant for the initial recognition. Nonetheless these residues promote the association of the human complex most likely via unspecific long-range interactions.
Furthermore, we mutated Asp1053 in CID1R ML to Ala, to assess potential salt-bridge formation between this residue and Arg2104 in NCBDD/P ML .
The complex between NCBDD/P R2104M and CID1R D1053A was very destabilized and the kinetic phase that reported on the binding event was too fast for the stopped-flow instrument. However, the concentration-independent kinetic phase was detected (kobs ≈ 20-40 s -1 ). Single charge mutations cannot be considered conservative since they may result in unpaired charges in or close to hydrophobic interfaces. Any effects from such mutations may also be due to nonspecific charge-charge attraction or repulsion. (The overall charge of NCBDD/P is positive and CID1R is negative.) Nevertheless, we report kinetic data for such single mutants. Interestingly, CID1R D1068A displayed a negative f-value (ca. -0.4, Table 1 and Table 3) due to an increase in both kon and koff upon mutation, suggesting that Asp1068 makes a non-favorable interaction in the transition state. This is consistent with the small effect on kon for NCBDD/P R2104M /CID1R D1068A , which might result from opposing effects on kon by the respective mutation. The other Asp mutant, CID1R D1053A , also displayed an increase in kon but a positive high f-value. Thus, Asp1053 forms nonfavorable interactions both in the transition state and in the native state of the complex. The NCBDD/P K2075M variant gives a high f-value suggesting a native interaction in the transition state. While Asp1053 is in the vicinity of Lys2075 the coupling free energy between them is low (0.17 kcal/mol) and it is not clear what interactions these residues make in the native complex. All surface charge mutations had little effect on kinetics or yielded low f-values suggesting that overall charge plays a minor role in the association of NCBDD/P and CID1R ( Table 1) and similarly for human NCBD/CID ( Table 2).
In agreement with these data, simulations supported the idea that hydrophobic interactions are more relevant than electrostatic contacts in the TS for formation of the ancestral complex. In fact, no stable salt-bridges or hydrogen bonds were observed, with Arg2104 contacting different polar residues only in a transient manner. Also, the high f-value of the NCBDD/P K2075M variant can be explained by the ability of Lys2075 to engage in hydrophobic, rather than polar, interactions stabilizing the native-like hydrophobic core formed by helices Ca1-Na1 ( Fig. 3b and S5).

Discussion
The higher prevalence of IDPs among eukaryotes as compared to prokaryotes suggests that these proteins have played an important role in the evolution of complex multicellular organisms (38,39). IDPs often participate in regulatory functions in the cell, by engaging in complex interaction networks that fine-tune cellular responses to environmental cues (40). One feature common to many IDPs, and which has likely contributed to their abundance in regulatory functions, is the ability to interact specifically with several partners that are competing for binding (41). NCBD is an archetype example of such a disordered protein interaction domain that has evolved to bind several cellular targets, including transcription factors and transcriptional co-regulators (18, 19, 42). Every time a new partner was included in the repertoire, NCBD somehow adapted its affinity for the new ligand, while maintaining affinity for already established one(s), as occurred around 450-500 Myr, when the interaction between NCBD and CID was established (4). On a molecular level, it is intriguing how such multi partner protein domains evolve. In the present study, we have extended our structural studies (28) and investigated the evolution of the binding mechanism using site-directed mutagenesis, fvalue analysis and restrained MD simulations to shed light on changes occurring at the molecular level when the low-affinity Cambrian-like NCBD evolved higher affinity for its protein ligand CID.
Recent works suggest that IDPs can adopt multiple strategies for recognizing their partners. Gianni and co-workers proposed the concept of templated folding (43), where the folding of the IDP is modulated, or templated, by its binding partner, as shown for cMyb/KIX (44), MLL/KIX (45) and NTAIL/XD (46,47). Similar ideas were put forward by Zhou and coworkers based on experiments on WASP GBD/Cdc42 and formulated in terms of multiple dock-and-coalesce pathways (48). On the other hand, studies on disordered domains from BH3-only proteins binding to BCL-2 family proteins suggest conservation of f values and a more robust folding mechanism (49). We recently showed by double mutants and simulation that a high plasticity in terms of formation of native hydrophobic interactions in the transition state exists for human NCBD/CID (23) where both partners are very flexible, in agreement with templated folding.
In the present study, by comparing the TSs for formation of human and Cambrian-like NCBD/CID complexes, we demonstrate that, while similar core interacting regions have been conserved throughout evolution, the interaction between the two proteins has evolved from a more ordered ancestral TS to the heterogenous and plastic behavior observed in the human complex. We find that the fraction of CID helical content in the transition state is overall conserved, with intermediate values in Ca1 (slightly higher in human than in Cambrian-like NCBD/CID) and low values in Ca2/3. Conversely, we observe that the transition state of the low-affinity Cambrian-like complex has more native-like features in terms of hydrophobic interactions (higher f-values) as compared to the human one, with clear sitespecific differences such as residues Leu2067 Leu2087 and Leu2096 of NCBD. In the ancestral TS, fewer but more native-like contacts are required to be formed ( Fig. S5 and S6) and proper NCBD tertiary structure (regulating Na1-Na2 orientation) is achieved before CID binding. Vice-versa, in the TS of human NCBD/CID numerous transient intermolecular interactions are engaged (Fig. S6), involving a large number of residues of both CID and NCBD.
There is always uncertainty in reconstructed ancient sequences. It is therefore important to assess whether the conclusions are robust to the inevitable errors present in reconstructed ancient sequences. In the present case we have not performed a f value analysis for alternative reconstructed NCBD or CID variants, due to the extensive experimental effort. However, the overall similar transition states of the human and ancient complexes suggest that the overall mechanism is robust, and that further point mutations would not dramatically change this picture. Furthermore, the five f values determined with an alternative Trp probe (Trp2067) were very similar to those with Trp2073 ( Table 3) also demonstrating that the mechanism is robust to mutation. Finally, the NCBD/CID affinity is relatively constant across the deuterostome animal kingdom (Fig. 1) despite several differences in the primary structure (4, 27) and it is therefore likely that the binding mechanism is also conserved.
In the homeodomain family of proteins a spectrum of folding mechanisms was previously observed (50) ranging from nucleation-condensation to diffusion-collision (51). Furthermore, it has been suggested that the two mechanisms can be related to the balance between hydrophobic and electrostatic interactions (52). Mutational studies on IDPs (21, 44, 45, 53-58) are more or less consistent with apparent two-state kinetics and the nucleation-condensation mechanism of globular proteins (59), i.e. cooperative, simultaneous formation of all non-covalent interactions around one well defined core. Salient features of this mechanism are linear Brønsted plots and fractional f-values. One alternative mechanism would be independently folding structural elements, which dock to form the tertiary structure as formulated in the diffusion-collision model (51). In such scenario, Brønsted plots would be more scattered and the f-values be both low and high and clustered in structurally contiguous contexts and even separated into two or more folding nuclei. Our comparison of secondary and tertiary structure formation in human versus Cambrian-like NCBD/CID is therefore interesting since it shows that the folding of certain elements of secondary structure can be distinct from others, and that they may or may not be part of an extended folding nucleus. The Brønsted plot for Cambrian-like NCBD/CID shows a larger scatter than that for human NCBD/CID (Fig.  2d). Whereas Ca1 of both CID1R ML and human CID displays fractional f-values, Ca2/3 have fvalues of zero (Fig. 4). Thus, Ca1 may function as a well-defined folding nucleus around which remaining structure condensate, as observed for high-affinity human NCBD. However, Ca1 may also be part of a more extended folding nucleus together with hydrophobic tertiary interactions as in low-affinity Cambrian-like NCBD/CID (Fig. 2-4), but not to the extent that we define it as two separate folding nuclei. The Cambrian-like NCBD/CID shows therefore an intermediate behavior between nucleationcondensation and diffusion-collision mechanism, that shifted towards nucleationcondensation during evolution. A similar diffusion-collision-like mechanism was recently found in the interaction between disordered YAP and TEAD (60). However, three other IDP interactions with more than one helical segment, Hif-1a CAD (58), TAD-STAT2 (57), and pKID (61) (all binding to KIX), do not display this behavior, but show mainly low f-values (<0.2) with only one or a few higher ones. Thus, so far, nucleationcondensation appears more prevalent for globular protein domains (62), as well as for IDPs in disorder-to-order transitions. It will be interesting to see whether other IDPs with several secondary structure elements display any distinct distribution of f-values.

Materials and methods
Ancestral and human protein sequences. The reconstruction of ancestral sequences of NCBD (from the CREBBP/p300 protein family) and CID (from the NCOA/p160/SRC protein family) has been described in detail before (4). Briefly, protein sequences in these families from various phyla were aligned and the ancestral protein sequences of NCBD and CID were predicted using a maximum likelihood (ML) method. The ML ancestral protein variants of NCBD and CID, NCBDD/P ML and CID1R ML , were used as "wildtypes" in this study. The human NCBD protein was composed of residues 2058-2116 from human CREBBP (UniProt ID: Q92793) and the human CID protein was composed of residues 1018-1088 from human NCOA3/ACTR (UniProt ID: Q9Y6Q9), in accordance with previous studies on the human protein domains (21, 24, 63). The reconstructed ancestral sequences of NCBD and CID were shortened to contain only the evolutionarily more well-conserved regions that form a well-defined structure upon association with the other domain. Thus, the ancestral NCBD variant was composed of residues corresponding to 2062-2109 in human CREBBP and the ancestral CID variant was composed of residues corresponding to 1040-1081 in human NCOA3/ACTR.

Cloning and mutagenesis.
The cDNA sequences for the protein variants used in the study were purchased from GenScript and the proteins were N-terminally tagged with a 6xHis-Lipo domain. The mutants were generated using a whole plasmid PCR method. The primers were typically two complementary 33-mer oligonucleotides with mis-matching bases at the site of the mutation, which were flanked on each side by 15 complementary bases. The annealing temperature in the PCR reactions was between 55-65 °C and the reactions were run for 20 cycles. The products were transformed into E. coli XL-1 Blue Competent Cells and selected on LB agar plates with 100 µg/mL ampicillin. The plasmids were purified using the PureYield™ Plasmid Miniprep System (Promega).

Protein expression and purification.
The plasmids encoding the protein constructs were transformed into E. coli BL-21 DE3 pLysS (Invitrogen) and selected on LB agar plates with 35 µg/mL chloramphenicol and 100 µg/mL ampicillin. Colonies were used to inoculate LB media with 50 µg/mL ampicillin and the cultures were grown at 37 °C to reach OD600 0.6-0.7 prior to induction with 1 mM isopropyl b-D-1-thiogalactopyranoside and overnight expression at 18 °C. The cells were lysed by sonication and centrifuged at approximately 50,000 g to remove cell debris. The lysate was separated on a Ni Sepharose 6 Fast Flow (GE Healthcare) column using 30 mM Tris-HCl pH 8.0, 500 mM NaCl as the binding buffer and 30 mM Tris-HCl pH 8.0, 500 mM NaCl, 250 mM imidazol as the elution buffer. The 6xHis-Lipo tag was cleaved off using Thrombin (GE Healthcare) and the protein was separated from the cleaved tag using the same column and buffers as described above. Lastly, the protein was separated on a RESOURCE™ reversed phase chromatography column (GE Healthcare) using a 0-70 % acetonitrile gradient. The purity of the protein was verified by the single-peak appearance on the chromatogram or by SDS-PAGE. The identity was verified by MALDI-TOF mass spectrometry. The fractions containing pure protein were lyophilized and the concentration of the protein was measured by absorption spectrometry at 280 nm for variants that contained a Tyr or Trp residue. For the proteins which lacked a Tyr or Trp residue, absorption at 205 nm was used to estimate the concentration. The extinction coefficient for human CID was previously determined by amino acid analysis to 250,000 M -1 cm -1 at 205 nm. For the shorter ancestral variants, the extinction coefficient was calculated based on the amino acid sequence (64).

Design and evaluation of NCBDD/P ML Trp variants.
In order to perform fluorescence-monitored stopped-flow kinetic experiments, a fluorescent probe is required. As both NCBD and CID lack Trp residues, which provides the best sensitivity in fluorescence-monitored experiments, several NCBDD/P ML variants with Trp residues introduced at different positions were constructed: NCBDD/P L2067W , NCBDD/P T2073W , NCBDD/P S2078W , NCBDD/P H2107W and NCBDD/P Q2108W . The NCBDD/P Q2108W variant corresponds to the NCBDHuman Y2108W variant, which was used previously as a "pseudowildtype" in stopped flow kinetic experiments (21, 23, 24). These NCBDD/P ML Trp variants were assessed based on secondary structure content and stability of complex with CID using far-UV circular dichroism (CD) spectroscopy, and by kinetic and equilibrium parameters from stopped-flow fluorescence spectroscopy and isothermal titration calorimetry (ITC), in order to find an engineered NCBDD/P ML variant with similar biophysical properties to the wildtype NCBDD/P ML (Fig. S1). The NCBDD/P T2073W variant displayed the most similar behavior to NCBDD/P ML . Our data showed that the structural content, complex stability as well as affinity of this Trp variant was highly similar to NCBDD/P ML (Fig. S1). Thus, our data validated the use of NCBDD/P T2073W variant as a representative of NCBDD/P ML and all stopped flow kinetic experiments for the ancestral Cambrian-like complex were performed using the "pseudo-wildtype" NCBDD/P T2073W variant, which we denote NCBDD/P pWT .

Stopped-flow spectroscopy and calculation of f-values.
The kinetic experiments were conducted using an upgraded SX-17MV Stopped-flow spectrofluorometer (Applied Photophysics). The excitation wavelength was set to 280 nm and the emitted light was detected after passing through a 320 nm long-pass filter. All experiments were performed at 4 °C and the default buffer for all experiments was 20 mM sodium phosphate pH 7.4, 150 mM NaCl. In order to promote secondary and tertiary structure formation of some structurally destabilized NCBD mutants, the experimental buffer was supplemented with 0.7 M trimethylamine N-oxide (TMAO; Fig. S2c-d).
Typically, in kinetic experiments the concentration of NCBD was kept constant at 1-2 µM and the concentration of CID was varied between 1-10 µM. Experiments where the concentration of CID was kept constant while NCBD was varied were also performed to check for consistency of the obtained results. In these experiments, CID was held constant at 2 µM and NCBD was varied between 2-10 µM. The kinetic binding curves with NCBD in excess was in good agreement with those using CID in excess, but since the quality of the kinetic traces were better when CID was in excess, these experiments were used to determine the kinetic parameters for the different variants reported in the paper. All experiments were performed using the pseudo wildtype NCBD variants, which was the NCBDD/P T2073W and NCBDHuman Y2108W variants. These variants are denoted as NCBDD/P pWT and NCBDHuman pWT , respectively.
Each stopped flow trace consisted of 1000 sampled data points (data points before 0.002 s were removed due to insufficient mixing within this time period). Each kinetic trace is typically an average of 4-10 individual traces (i.e 4-10 technical replicates). When the data were fitted globally, the large number of data points resulted in underestimated standard errors. In order to correct for this, we determined the error in kon based on five biological replicates (i.e performed with different protein batches), which was 10 % for the NCBDD/P pWT /CID1R ML complex. This error was then applied on kon for all other protein variants.
The dissociation rate constant koff can be determined from binding experiments, but the accuracy decreases whenever kobs values are much larger than koff or when koff is very low. In the present study, displacement experiments were performed for protein complexes with koff values below 30 s -1 in binding experiments. In displacement experiments, an unlabeled NCBD variant (without Trp) was used to displace the different NCBDD/P pWT or NCBDHuman pWT variants from the complexes. The kobs value at 20-fold excess of the unlabeled NCBD variant was taken as an estimate of the dissociation rate constant, koff. The f-values for each mutant were computed using the rate constants that were obtained in the stopped-flow measurements (e.g kon and koff) using Equations 1-3.

Isothermal titration calorimetry.
Isothermal titration calorimetry measurements were performed at 25 °C in a MicroCal iTC200 System (GE Healthcare). The proteins were dialyzed simultaneously in the same experimental buffer (20 mM sodium phosphate pH 7.4, 150 mM NaCl) in order to reduce buffer mismatch. The concentration of NCBD in the cell was 12-50 µM (depending on variant) and the concentration of CID in the syringe was between 120-500 µM, depending on NCBD concentration, such that a 1:2 stoichiometry was achieved at the end of each experiment. The data were fitted using the built-in software to a two-state binding model.

Data analysis using numerical integration.
The stopped flow kinetic data sets were fitted using the KinTek Explorer software (KinTek Corporation) (66,67). The software employs numerical integration to simulate and fit reaction profiles directly to a mechanistic model. Scaling factors were used to correct for small fluctuations in lamp intensity and errors in concentration, but they were generally close to 1. In cases where the signal-to-noise in the obtained data was low, scaling factors were not applied. For some more-than-one-step models, two-dimensional confidence contour plots were computed to assess confidence limits for each parameter and co-variation between parameters. An estimated real time zero of the stopped flow instrument of -1.25 ms was used to adjust the timeline in order to obtain correct kinetic amplitudes. The fitted data was exported and graphs were created in GraphPad Prism vs. 6.0 (GraphPad Software).

MD simulations of the transition state ensembles.
The transition state for formation of the human complex was previously determined by means of f-value restrained molecular dynamics simulations (63). Here, the same procedure was followed to determine the TS of the ancestral CID-NCBD complex. The simulations were performed with GROMACS 2018 (68) and the PLUMED2 software (69), using the Amber03w force field (70) and the TIP4P/2005 water model (71). The initial conformation was taken from available PDB structure (6ES5) (28) and modified with Pymol (72) to account for the T2073W mutation. The structure was solvated with ~6700/16800 water molecules (for native state and TS simulations, respectively), neutralized, minimized and equilibrated at the temperature of 278 K using the Berendsen thermostat (73). Production simulations were run in the canonical ensemble, thermosetting the system using the Bussi thermostat (74); bonds involving hydrogens were constrained with the LINCS algorithm (75), electrostatic was treated by using the particle mesh Ewald scheme (76) with a short-range cut-off of 0.9 nm and van der Waals interaction cut-off was set to 0.9 nm. A reference native state simulation, at the temperature of 278 K, was performed to determine native contacts. Firstly, we ran a 40 ns long restrained simulation to enforce agreement with atomic inter-molecular upper distances previously determined from NMR experiments (28): to this aim lower wall restraints were applied on the NOE-converted distances. Subsequently, an unrestrained 280 ns long simulation was performed and the last 200 ns were used to determine native contacts: given two residues that are not nearest neighbors, native contacts are defined as the number of heavy side-chain atoms within 0.6 nm in at least 50% of the frames.
The TS ensemble of the ancestral complex was determined via f-value restrained MD simulations, following a standard procedure based on the interpretation of f-value analysis in terms of fraction of native contacts (63,77,78). Herein, restraints (in the form of a pseudo energy term accounting for the square distance between experimental and simulated f-values) are added to the force field to maximize the agreement with the experimental data: the underlying hypothesis is that structures reproducing all the measured f-values are good representations of the TS. From each conformation the f-value for a residue is backcalculated as the fraction of the native contact (determined from the native state simulation) that it makes, implying that only f-values between 0 and 1 can be used as restraints. Totally, we included 11 f-values in this range, all based on single conservative point mutations. Mutations involving charged amino acids (namely, K2075M, involved in intermolecular interactions, and the Ala®Gly substitutions at positions D1050 and R1069, probing the helical content of CID helices Ca1 and Ca2, respectively) were excluded; we however verified that the structural ensemble obtained could provide a consistent interpretation of the associated f-values. A list of the f-values used in the ancestral and human TS simulations is reported in Table S1. The TS ensemble was generated using simulated annealing, performing 1334 annealing cycles, each 150 ps long, in which the temperature was varied between 278 K and 378 K, for a total simulation time of 200 ns. The TS was determined using only the structures sampled at the reference temperature of 278 K in the last 150 ns of simulation, resulting in an ensemble of ~5400 conformations. All the input files needed to perform the TS MD simulation are available on the PLUMED-NEST repository (79), as plumID: 2020-021.
Data availability: All data are contained within the manuscript. All kinetic and thermodynamic data, and calculations of f values are compiled in an excel file provided as Supporting Dataset S1.
of Medicine and Pharmacy. C.C. acknowledges CINECA for an award under the ISCRA initiative, for the availability of high-performance computing resources and support.
Author contributions E.K. and P.J. conceived and designed the project. E.K., A.E., Z.A.T., F.S., E.A. and W.Y. performed experiments and analyzed data. C.P. and C.C. designed, performed and analyzed all MD simulations. E.K., C.P., C.C., and P.J. interpreted the data and wrote the paper.

Competing interests
The authors declare no competing interests.  Tables   Table 1. Rate constants of binding for ancestral NCBD and CID variants determined in stopped flow experiments. The mutants were either helix-modulating mutants of ancestral CID or deletion mutations in NCBD or CID chosen to probe for native inter-and intramolecular interactions in the ancestral complex (Supporting Dataset S1a). The rate constants were obtained from global fitting of the kinetic data sets to a simple two-state model by numerical integration. The NCBDD/P pWT / CID1R ML complex was measured five times (biological replicates, i.e performed with different protein batches) and the error in kon is the standard deviation for these replicates. The NCBD and CID mutants were measured once and the 10 % error determined for the wildtype by replicate experiments was applied to these variants. The error in koff is the standard error from global fitting, except for the variants for which displacement experiments were performed. All experiments were performed in 20 mM sodium phosphate pH 7.4, 150 mM NaCl at 4 ºC. The dissociation rate constant was determined in a separate displacement experiment with the same experimental conditions as described above. The kobs value at 20-fold excess of the displacing protein was taken as an estimate of koff along with the standard error of the fit to a single exponential function.

2079 2089 2099 2109
.    The confidence contour plot shows the variation in chi 2 as two parameters are systematically varied while the rest of the parameters are allowed to float, which can reveal covariation between parameters in a model. The color denotes the chi 2 /chi 2 min value according to the scale bar to the right. Here, the confidence contour plot showed that k2 was poorly defined. The yellow boundary represents a cutoff in chi 2 /chi 2 min of 0.8. (e) The stopped-flow kinetic traces were fitted to a double exponential function to extract kobs values, which were plotted against the concentration of CID1R D1068A (Supporting Dataset S1e). The trends in the kobs values suggest one fast linear phase (black dots) which reports on binding and one slow phase with a constant kobs of 15 s -1 (grey dots). The error bars are standard errors from fitting to a double exponential function. (f) Binding of NCBDD/P R2104M /CID1R D1068A monitored by isothermal titration calorimetry in 20 mM sodium phosphate pH 7.4, 150 mM NaCl at 4°C. Fitting to a two-state model yielded a Kd of 5.1 ± 0.3 µM (Supporting Dataset S1f).