Mathematical modeling reveals spontaneous emergence of self-replication in chemical reaction systems

Explaining the origin of life requires us to elucidate how self-replication arises. To be specific, how can a self-replicating entity develop spontaneously from a chemical reaction system in which no reaction is self-replicating? Previously proposed mathematical models either supply an explicit framework for a minimal living system or consider only catalyzed reactions, and thus fail to provide a comprehensive theory. Here, we set up a general mathematical model for chemical reaction systems that properly accounts for energetics, kinetics, and the conservation law. We found that 1) some systems are collectively catalytic, a mode whereby reactants are transformed into end products with the assistance of intermediates (as in the citric acid cycle), whereas some others are self-replicating, that is, different parts replicate each other and the system self-replicates as a whole (as in the formose reaction, in which sugar is replicated from formaldehyde); 2) side reactions do not always inhibit such systems; 3) randomly chosen chemical universes (namely random artificial chemistries) often contain one or more such systems; 4) it is possible to construct a self-replicating system in which the entropy of some parts spontaneously decreases, in a manner similar to that discussed by Schrödinger; and 5) complex self-replicating molecules can emerge spontaneously and relatively easily from simple chemical reaction systems through a sequence of transitions. Together, these results start to explain the origins of prebiotic evolution.

sor life does not have nucleic acids but consists of a self-replicating (or "autocatalytic," in his words) metabolic network (6). Another proposal, by Szathmáry (7)(8)(9), is that life evolved from "holistic limited hereditary replicators" such as the formose reaction-in which sugar is replicated from formaldehyde-to "modular unlimited hereditary replicators" such as RNA and today's DNA.
There are lots of biological examples of self-replicating systems that do not rely on a single complex template molecule (as is the case for DNA and RNA molecules). These include the malic acid cycle (a metabolic path of some bacteria and plants for synthesis of malates), the Calvin cycle in photosynthesis (10), the reductive citric acid cycle for a certain group of chemoautotrophs (11,12), the artificially designed myokinasepyruvate kinase cycling system (13), the metabolic pathways of ATP (as well as some other coenzymes such as NAD ϩ and CoA) found in many different living organisms (14), and the whole metabolic reaction network of Escherichia coli (15). Self-replication has also been identified in nonliving systems, such as the formose reaction (10,16), and experiments in laboratories (17)(18)(19), including self-replication of nucleotide-based oligomers (20).
Many models have been put forward to explain the origin of self-replication in terms of a system of coupled chemical reactions (21). Some of these can be categorized as artificial chemistry models, as reviewed by Dittrich et al. (22) and by Banzhaf and Yamamoto (23). For example, the chemoton model describes a system composed of three coupled quasiself-replicating subsystems (metabolism, membrane, and template), which as a whole is able to self-replicate (10). The chemoton can be considered as a model of a minimal living system, but cannot explain how this system spontaneously develops from a soup of simple molecules. The metabolically coupled replicator system is another type of model in this direction (24,25). But it mainly focuses on the information heredity and ecological stability, that is, how to maintain coexistence of different replicators and constrain parasitism, rather than investigating why self-replication can occur in the first place. The most influential model in this direction is the reflexively autocatalytic and food-generated (RAF) 2 theory (26 -28), extended from Kauffman's autocatalytic sets theory (29). A set of chemical reactions is RAF if 1) every reaction in this set is catalyzed by at least one molecule involved in this set and 2) every molecule involved in this set can be produced from a small food molecule set. RAF sets are shown to be able to readily emerge from a set of chemical reactions and always consist of smaller RAF sets, demonstrating the capability to evolve (30,31). Other similar models also contributed to this theory (32)(33)(34)(35)(36), including the chemical organization theory (37) and the graded autocatalysis replication domain model (38). The former is closely related to RAF theory (see Hordijk et al. (39) for a detailed comparison), whereas the latter suffers from lacking evolvability (see Vasas et al. (40,41) for detailed critical analyses). In addition, many of the biological observations mentioned in the previous paragraph can be put into the framework of RAF theory (15,(17)(18)(19)(20).
Although RAF theory is influential, it has limitations as an explanation of self-replication. First, the theory stipulates that every reaction in the set is catalyzed. Even though catalyzed reactions are very common in living systems, not every reaction involves a catalyst (e.g. condensation reactions (8,42)). In the early stages of biological evolution, probably no reaction required sophisticated biotic catalysts (e.g. enzymes) (11,15), so there is a strong motivation not to include these enzymes (20,43,44) or even catalysts as given in a model of the origins of self-replication. Uncatalyzed reactions have significant effects on the dynamics of the whole system (e.g. "innovating" new species of molecules and then triggering other RAF sets) (45,46). The second concern with RAF theory is that it is a purely graph-theoretic approach. As a result, there is no constraint on how chemical reactions are constructed and coupled (although it gives the theory much freedom, the construction of reaction systems is too arbitrary to investigate the model systematically). As another result, extra assumptions need to be made about chemical kinetics to investigate the dynamics of how populations of molecules change over time (47). Here, we see another reason to relax the assumption of studying only catalyzed reactions: catalyzed reactions are never elementary reactions, so the kinetics cannot be simply calculated from the reaction stoichiometry (48).
Any theoretical approach to the origin of self-replication should explicitly include energetics (15,49), an aspect that is missing from all of these models and theories above. There are several reasons for this. First, energetics (e.g. Gibbs energy) determines whether a chemical reaction is spontaneous, so to investigate the spontaneity of the emergence of life, it has to be considered. Second, to concretely discuss the issue-famously put forward by Schrödinger-that life maintains its order by feeding on "negative entropy" (50), energetics has to be explicitly taken into account, as entropy is negatively related to the thermodynamic free energy (which is Gibbs energy in the scenario of constant pressure and temperature (51)). It should be noted that the relationship between life and entropy is investigated in different ways and contexts (52)(53)(54). For example, Branscomb and Russell (53) explained the specific mechanisms to increase thermodynamic free energy in two real-world biochemical scenarios: the hypothesized alkaline hydrothermal vent (55) on the prebiotic Earth and the system where the ferredoxin I protein translocates protons. In the context of statistical physics, a lower limit was derived for the amount of heat generated in a nonequilibrium system where a process of selfreplication occurs (54).
In this paper, we set up an artificial chemistry model, in the form of a general mathematical framework for chemical reaction systems, that properly accounts for energetics, kinetics, and the conservation law. Catalysts are not explicitly included in the model, but we later find that catalysis can emerge, along with self-replication and potentially complex molecules, in our system.
In what follows, we give an overview of our model. The specifications are given under "Theory" at the end of this paper. We would encourage the reader, on first reading, to watch a short movie included in the supporting information that illustrates how our model works, and see "Theory" directly afterward.
We model a well-mixed soup of molecules, each of which is defined by its integer mass, i. A molecule's type is thus denoted ī. Only synthesis reactions and decomposition reactions are possible. Only reactions that conserve mass can occur (i.e. the total mass on the reactant side adds up to that on the product side). Each type of molecule ī has its own standard Gibbs energy of formation G i°, that decides whether a reaction is spontaneous. Moreover, a reaction has to overcome an energy barrier (namely Gibbs energy of activation) to occur. It is called lowbarrier if its energy barrier is low and its reaction rate is thus high or high-barrier otherwise. To define a specific chemical reaction system, we show a list of low-barrier (and spontaneous) reactions only (e.g. Scheme 1).
Many other reaction systems can also be constructed. Some of them are physically possible-meaning that there are proper G i°v alues for all molecules involved so that all reactions listed are spontaneous, whereas some others are not. We are only interested in physically possible systems.
After that, we assume that there is an unlimited reservoir of resource molecules (e.g. molecule 2 for system shown in Scheme 1). Then we use the standard Gillespie algorithm to simulate the dynamics, namely N i (t), the time series of molecule ī's population. In addition, we construct ordinary differential equations (ODEs) to describe the mean-field dynamics.

Collectively catalytic system
We first investigate the system shown in Scheme 1, which is actually a model of the citric acid cycle in cellular respiration (10). Specifically, as shown in Fig. 1a, molecule 1 represents carbon dioxide, 2 represents acetyl-CoA, 4 represents oxaloacetic acid, 5 represents ␣-ketoglutaric acid, and 6 represents citric acid, respectively.
Given an unlimited reservoir of resource molecules 2, the simulation shows that N 1 (t) increases linearly. Also, molecules 4, 5, and 6 are involved in a cycle of reactions, but the total number is constant. Section S4 shows the dynamics and ODE solutions in more detail. We also observe that each reaction in Scheme 1 occurs approximately the same number of times. We

Modeling spontaneous self-replication
can add up these reactions, cancel out the molecules appearing on both sides of the reaction (in this case, molecules 4, 5, and 6), and then obtain the overall reaction of the system 2 3 1 ϩ 1. By comparison, the reaction 2 3 1 ϩ 1 itself is high-barrier, and its reaction rate is thus extremely low, but through the whole system, the actual rate of the overall reaction is several billion times larger (section S4). We call the Scheme 1 system the collectively catalytic system, because the overall reaction 2 3 1 ϩ 1 is catalyzed by molecules 4, 5, and 6. Note that this outcome is consistent with the biological observation that the citric acid cycle consumes acetyl-CoA (molecule 2) and produces carbon dioxide (molecule 1) as a waste product.
The linear growth of N 1 (t) occurs because 1) 1 is an end product that cannot be used by these low-barrier reactions; 2) the number of resource molecules 2 is constant; and 3) no additional molecules 4, 5, and 6 can be produced through these low-barrier reactions (by noting that the number of times molecules 4, 5, and 6 appear on the right-hand side of Scheme 1 is the same as that on the left-hand side).
We can use these observations to give a rigorous set of criteria for collectively catalytic systems. We start by defining an intermediate molecule to be any molecule that appears on both the reactant side and the product side. In general, the following stoichiometric criteria are sufficient (but not necessary) to show that a physically possible chemical reaction system, given supplies of resource molecules, is collectively catalytic: 1) for every low-barrier reaction, at least one type of its reactants comes from the products of other low-barrier reactions (called the criterion for self-driven), and 2) by adding up all of the low-barrier reactions, for every type of intermediate molecule, the number of times it appears on the reactant side and that on the product side are the same (called the criterion for balanced-cancelling).
The citric acid cycle (Scheme 1) thus satisfies all of these criteria, and there are other systems satisfying these criteria too (in fact, any single catalytic reaction can be written as a collectively catalytic system: see section S5 for details). The criteria above give us a way of discerning whether or not a system is collectively catalytic based on stoichiometry alone, without the need to investigate its dynamics.

Self-replicating system
Now we investigate the formose reaction, which involves the formation of sugars from formaldehyde (10).
Given the resource molecule 1, its dynamics are shown in Fig.  2. All three intermediate molecules (N 2 (t), N 3 (t), and N 4 (t)) increase exponentially. The solutions of the corresponding ODEs are consistent with the simulations (section S6).
The fact that 2 grows exponentially can be seen directly by adding up the three low-barrier reactions to obtain 1 ϩ 1 ϩ 2 3 2 ϩ 2. It indicates that if one 2 is present beforehand, one extra 2 can be produced, by transforming two of 1. Then the additional 1 is further used by the system, and more 2s are produced. Although the molecules 3 and 4 are canceled out when we add up the reactions, they also grow exponentially (with 4 increasing much more slowly than the other two). The reason is that the actual reaction rate of each low-barrier reaction is not the same (see section S6). This observation is important because it illustrates that it is not just one type of molecule that grows exponentially in such systems, but all of the intermediate molecules.
We define a self-replicating system, of which Scheme 2 is an example, to be a system in which at least one type of molecule is replicated. By investigating various self-replicating systems, we find that not only exponential but also superexponential growth is observed (see examples in section S7). The dynamics indicate that the reactions in self-replicating systems become faster and faster. This is a very special property compared with the collectively catalytic system (e.g. the citric acid cycle (Scheme 1)), where the overall reaction rate keeps constant.
In general, the following stoichiometric criteria are sufficient (but not necessary) to show that a physically possible chemical    (   Figure 1. Sketches of three chemical reaction systems specified within our model. Details are described under "Theory." a, a simplified model of the citric acid cycle. b, a simplified model of the formose reaction. In both simplified models, we only consider carbon-changing reactions and do not consider isomers of some molecules. Note that the integer representation of mass can be thought of as a rough ordering of the molecule's mass or relative complexity. Consequently, the addition operation representation for chemical reactions guarantees the mass conservation. c, a physically impossible chemical reaction system (Scheme 7).

Modeling spontaneous self-replication
reaction system, given supplies of resource molecules, is selfreplicating: 1) the criterion for self-driven (mentioned above) is satisfied; 2) there are some types of intermediate molecules, and the number of times they appear on the reactant side is less than that on the product side (called the criterion for overproduction); 3) there is no type of intermediate molecule for which the number of times it appears on the reactant side is larger than that on the product side (called the criterion for no-overintake). The formose reaction (Scheme 2) satisfies all of these criteria.

Effect of side reactions on self-replicating systems
Let us consider the following system (Scheme 3), which is the formose reaction coupled with an additional reaction that transforms 4 and 5 to 9.
The last reaction can be thought of as a side reaction, which consumes the intermediate molecules.
When there is an infinite reservoir of molecule 1, the dynamics of this system are exactly the same as in Fig. 2, because the side reaction 4 ϩ 5 3 9 cannot proceed without 5. Even with an infinite reservoir of both molecules 1 and 5, the dynamics of this system are unaffected, because reaction 4 3 2 ϩ 2 occurs so fast that the side reaction 4 ϩ 5 3 9 cannot obtain molecule 4 to proceed. In general, by observing Equations 1 and 3 under "Theory," decomposition reactions, such as 4 3 2 ϩ 2, occur faster than synthesis reactions, because the reaction rate for synthesis, ␥ ϩij , has an extra term N j /(S ϩ N) (compared with ␥ Ϫij ), which is always smaller than 1. In our experimental setting, the number of solvent molecules S is very large, so this extra term is much smaller than 1. In such settings, the selfreplicating system outcompetes the side reaction.
Nonetheless, self-replication can be inhibited by some other side reactions. Consider the following system (Scheme 4).
Given infinite reservoirs of both molecule 1 and 5, we no longer observe self-replication (i.e. N 2 (t) and N 3 (t) do not grow exponentially), because this side reaction consumes the newly produced molecule 2 before it can be used by reaction 1 ϩ 2 3 3.
Note that we always assumed ij ϭ 10 for every low-barrier reaction (see "Theory"), which means that the reaction rate constant ␤exp(Ϫ⌬G ϩij ‡ ), or ␤exp(Ϫ⌬G Ϫij ‡ ), for every lowbarrier reaction is identical, denoted as . But for now, we relax this assumption. Let the reaction rate constants for the first three reactions in Scheme 4 be identical, , and the rate constant for the side reaction be where Ն 0. Firstly, when ϭ 0, the Scheme 4 system goes back to the original formose reaction. Secondly, when the side reaction occurs at low rate (i.e. 0 Ͻ Ͻ 1), the onset of self-replication is delayed but still occurs eventually (the larger is, the more delayed the onset of self-replication becomes; figures not shown). Lastly, when the side reaction has a sufficiently high rate (i.e. Ն 1), it consumes the newly produced molecule 2 before it can be used by reaction 1 ϩ 2 3 3, so self-replication is completely inhibited. ϭ 1 is the critical value in this case.
In the real chemical universe we are living in, the formose reaction is often inhibited by side reactions. Many products further react into a "browning tar" so that the formose reaction has a very low yield of sugars (56). However, this does not preclude the existence of a similar reaction system in the prebiotic world, because, in general, the stability of a self-replicating system depends on the existence and reaction rates of other lowbarrier reactions. Whether self-replication is inhibited depends on how side reactions are coupled with the system and what the specific condition is. Also note that all of the arguments above apply to collectively catalytic systems.

Collectively catalytic and self-replicating systems are common
Another natural question to ask is how common these reaction systems are, such as the citric acid cycle and the formose reaction. Specifically, if we construct alternative chemical universes, or, in another term, different artificial chemistries, where we systematically choose which reactions are low-barrier, we can ask how many of the resulting artificial chemistries contain collectively catalytic or self-replicating systems. To answer this question, we start by using the criterion shared by both of them, that they are self-driven. Table 1 shows the numbers for different L, which is set to be the mass of the largest molecule in this particular artificial chemistry in question, to have a measurement of the number of all chemical reaction systems. For example, when L ϭ 6, there are in total 1, 2, . . . , 6,  S3a). Note that N 4 (t) fluctuates frequently between 0 and small numbers, so the curve looks like a block. We set

Modeling spontaneous self-replication
six types of molecules. By choosing which reaction is low-barrier, we can construct Α l ϭ 0 9 ( l 9 )⅐2 l ϭ 19,683 different artificial chemistries, 16,825 of which turn out to be physically possible. Using the criterion for self-driven given above, we find that 6886 (41%) of all of the physically possible artificial chemistries contain self-driven systems. This percentage increases with L, which indicates that self-driven systems are common and are more common in systems involving more types of molecules.
However, we cannot be sure that these self-driven systems are collectively catalytic or self-replicating, and there is even a third type of self-driven system, the nonsustaining system, which makes things more complicated (see section S8 for details of the nonsustaining system and section S9 for more details of classification for the self-driven system). Nevertheless, we can use the stoichiometric criteria mentioned above (note that these criteria are sufficient but not necessary) to give a lower bound on the number of artificial chemistries that contain collectively catalytic or self-replicating systems. That is, for a self-driven system, a system is collectively catalytic if it satisfies the criterion for balanced-canceling, or self-replicating if it satisfies both the criteria for overproduction and no-overintake.
Using the stoichiometric criteria, we find that the lower bound on the number of artificial chemistries containing collectively catalytic or self-replicating systems increases with L, although the percentage decreases (Table 1). However, because it is the lower bound, it does not mean that the actual number of artificial chemistries containing collectively catalytic or selfreplicating systems decreases. Establishing a firm relationship between the number of chemicals (L) and self-replication will involve simulating dynamics of all systems.

How can life maintain low entropy?
According to the second law of thermodynamics, the total entropy of an isolated system never spontaneously decreases over time. Life, thought of as an open system as opposed to an isolated one, is able to maintain order (i.e. maintain a relatively low entropy level). Schrödinger suggested that this is achieved by life "feeding on negative entropy" (50). His question is how this can happen spontaneously. But before we answer this, we first need a way to discuss this question concretely and more quantitatively.
Under the framework of our model, if we simply consider life as some self-replicating entity, we should then ask if it is possible for a self-replicating system to spontaneously increase its Gibbs energy or at least keep it unchanged. (We first note that in the scenario of constant pressure and temperature, the decrease of entropy corresponds to the increase of Gibbs energy, because G ϭ H Ϫ TS, where G is Gibbs energy, H is enthalpy, T is temperature, and S is entropy (51).) Let us consider the self-replicating system in Scheme 5, given the resource molecule 2. The simulation shows that N 1 (t), N 3 (t), and N 4 (t) increase exponentially, as well as N 5 (t) and N 6 (t) (although they are very small; see section S10). Molecule 1 is the end product, whereas 3, 4, 5, and 6 are replicated through the whole system. We then consider that the "living" system consists of the self-replicating part (namely all of molecules 3, 4, 5, and 6) and the resource molecules in the system. Now we investigate how Gibbs energy of the system changes. We set Gibbs energy of the initial system to zero (as the reference point), because only relative quantity matters. Therefore, Gibbs energy of the self-replicating part is G replicating (t) ϭ Α i ϭ 3 6 N i (t)⅐G i°. Gibbs energy of the resource molecules in the system is G resource (t) ϭ ϪF 2 (t)⅐G 2°, where F 2 (t) is the number of resource molecules 2 ever consumed until time t. Gibbs energy of the waste is G waste (t) ϭ N 1 (t)⅐G 1°. Then the Gibbs energy of the living system is G living (t) ϭ G replicating (t) ϩ G resource (t). As shown in Fig. 3, G living (t) increases, whereas G total (t) ϭ G living (t) ϩ G waste (t) decreases.
We have thus given an explicit example of a self-replicating "living" system that spontaneously consumes the resources to increase its own Gibbs energy. Note that our system, as defined, is a well-mixed gas system. So the waste molecules 1 are not automatically separated from other molecules, and Gibbs energy of the gas-mixing process is neglected in the calculation above. However, the contribution of the fact of well-mixed gas is relatively small (for details, see section S10). So it is still possible for a self-replicating system to spontaneously increase Gibbs energy or at least keep it unchanged.

Spontaneous evolution from simple toward complex
Nature provides many examples of the spontaneous evolution from simple toward complex. Is it possible to construct a

Modeling spontaneous self-replication
system showing a similar process? Imagine there is a chemical reaction system composed of the low-barrier reactions listed in Table 2, given an infinite reservoir of only resource molecules 1.
The first three reactions constitute the formose reaction (Scheme 2), given the resource molecule 1. The three reactions in Scheme 8 constitute a collectively catalytic system, given the resource 3. The 13 reactions in Scheme 9 constitute a self-replicating system, given the resource 1. Is it possible that lots of complex molecules, such 12, 13, and 14, are produced in the end?
The answer in this case is yes. If the first 12 is produced, the self-replicating system of Scheme 9 will be triggered, and consequently N 12 will grow exponentially, as well as N 13 , N 14 , . . . , N 23 . But how is the first 12 produced? There are actually three stages. 1) Initially, when there are only lots of 1s and nothing else, the system stays almost unchanged for a very long time because no low-barrier reaction could occur. Occasionally, by the high-barrier reaction 1 ϩ 1 3 2, one molecule 2 is produced. The self-replicating system of Scheme 2 is triggered, and then N 2 and N 3 grow exponentially. Very quickly, there are lots of 2 and 3. 2) After a relatively long "boring" period, the first 5 is produced by the high-barrier reaction 2 ϩ 3 3 5. Then the collectively catalytic system of Scheme 8 is triggered, and N 6 grows. Very soon there are lots of 6. 3) After that, occasionally one 12 is produced by the high-barrier reaction 6 ϩ 6 3 12.
One might naively believe that the first 12 can be produced by other reactions without the need for self-replicating and collectively catalytic systems as Schemes 2 and 8. This is not the case. Despite an abundance of molecules 1, which could be used to "assemble" an initial 12, the production of the first 12 requires high-barrier reactions (e.g. 6 ϩ 6 3 12). It is only when there is large number of molecules 6 that one such reaction is sufficiently likely to occur. However, without those preceding self-replicating and collectively catalytic systems, molecules 6 cannot be sufficiently produced.
The stage-by-stage procedure mentioned above could be generic for how chemical reaction systems evolve toward complex: a relatively simple innovation triggers some self-replicat-ing or collectively catalytic systems and then a large number of new types of molecules are produced, paving the way for other innovations.
Meanwhile, the more types of molecules there are, the more probable it is to have reactions that are high-barrier before becoming practically low-barrier. For example, in the formose reaction (Scheme 2), if the reaction 1 ϩ 3 3 4 is high-barrier, the system is not self-replicating. But imagine that the following three reactions are low-barrier. They constitute a collectively catalytic system. Then the reaction 1 ϩ 3 3 4, which is the overall reaction of the three, can still be considered as low-barrier, and the formose reaction system is still self-replicating. The only complication is that we have to wait for the complex molecule 30 to appear.

Discussion
We have set up a general mathematical model for chemical reaction systems in which molecules are represented by integers, roughly corresponding to the molecules' mass. All chemical reactions are defined by addition operations, which automatically guarantees mass conservation. Each molecule has a predefined standard Gibbs energy of formation, whereas each reaction has a predefined Gibbs energy of activation, which defines the rate constant for each reaction. By changing the Gibbs energy of molecules and reactions, we have investigated a range of artificial chemistries, some of which are like chemical systems found in our real universe. The strength of our approach is that it accounts for the conservation law, energetics, and kinetics while being general enough to investigate the origins of self-replication and the emergence of life.

Table 2
An example chemical reaction system that is able to evolve from simple toward complex

Modeling spontaneous self-replication
tively catalytic and self-replicating systems are vital in biology, whereas the third (nonsustaining) system appears less important. In terms of generating complexity, the self-replicating system plays a more important role, because it is able to replicate innovations. For example, in the self-replicating formose reaction (Scheme 2), after the first molecule 2 is produced by a high-barrier reaction, more 2s are easily replicated. In a biological setting, if this molecule spreads to other places, it can trigger more self-replicating systems. In contrast, in the collectively catalytic citric acid cycle (Scheme 1), for example, after the innovation (the first molecule 5), the second 5 will not appear until the responsible high-barrier reaction occurs once again. In biology, most metabolic reactions are catalyzed by sophisticated and highly specific enzymes, and thus avoid disturbance from undesired side reactions. In the very early stage of life, however, probably no reaction required enzymes (11,15). We found that whether side reactions prevent a system from selfreplicating depends on how they are coupled with the system and the specific conditions. There appear to exist examples where self-replication can be sustained in the presence of side reactions. However, how sophisticated enzymes develop de novo so that side reactions are avoided is another question, which deserves further investigation.
By arbitrarily constructing different artificial chemistries, we found that lots of them contain self-driven systems, and the lower bounds on the number of artificial chemistries containing collectively catalytic or self-replicating systems increase with more types of molecules. This result suggests that in a random artificial chemistry, it would not be too surprising to observe the emergence of self-replication, one of the central properties of life (1). Although it is not the first theory to propose that self-replication is relatively easy to emerge, as the RAF theory did (29 -31), it is the first one requiring no catalyst.
We provided a general model explicitly showing that high thermodynamic free energy molecules can be produced exponentially from low free energy molecules, whereas specific mechanisms in specific real-world scenarios have been investigated before (52,53). The example system we showed is a metaphor of why high free energy ATP molecules are constantly produced in organisms (14). In addition, as our model takes energetics (corresponding to entropy) into account, it provides a more concrete way to discuss the issue-famously put forward by Schrödinger (50)-why life is able to spontaneously maintain a relatively low entropy level, although it cannot give the full answer. Answering the questions of how the molecules in living systems can be placed in an ordered structure (namely a low entropy state) would require extending our model to include spatial effects.
Our model explicitly shows that complexity evolves from extreme simplicity stage by stage. It gives insights into three issues related to the origin of life. First, the first RNA molecule is much more likely to be produced de novo by this stage-bystage procedure, rather than a magic event (3). It provides theoretical supports to metabolism-first theories (3), such as Wächtershäuser's iron-sulfur world hypothesis (6) and Szathmáry's theory (7)(8)(9). Second, before life, Earth should have gone through many stages in which different self-replicating systems existed, and consequently Earth's compositions were different in each stage. The raw materials for life that we should look for are those for the first self-replicating system rather than those for the extant life (6). That is why, in the current theoretical framework, the raw materials for life (e.g. nucleotides) seem not to be available on the primordial Earth (3,4). Third, collectively catalytic and self-replicating systems generate more types of new molecules, and in return, more types of molecules make more reactions feasible (in the form of catalysis and autocatalysis). This could explain why metabolic reactions in extant life always require sophisticated enzymes (15), whereas no reaction is expected to involve catalysts in the very early stage of life (20,43,44).
As we noted when introducing the model, in order to be as simple as possible, we currently do not fully account for different isomers or for the impossibility of certain chemical transformations. Nevertheless, the current model can be considered as a simpler version of a more general model, with a more intricate way of "ordering" the complexity of molecules (e.g. accounting for different isomers).
Our model provides a convenient platform to construct artificial chemistries and investigate general properties of chemical reaction systems. It may provide a theoretical guideline for systematically searching for other chemical paths toward life (or at least self-replicating entities), as pursued in astrobiology (57,58) and xenobiology (59) for example.

Constructing chemical reaction systems
We model a well-mixed soup of molecules, each of which is defined by its integer mass, i. A molecule's type is thus denoted ī. Only two types of reaction are possible: synthesis of two molecules to create a molecule of greater mass (e.g. 2 ϩ 4 3 6) and decomposition into two molecules to create two molecules of lower mass (e.g. 6 3 1 ϩ 5). All reactions that conserve massthe total mass on the left-hand side of the equation adds up to those on the right-hand side-can occur. For convenience, we define a reaction pair to be a reaction and its corresponding reverse reaction.
Each type of molecule ī has its own standard Gibbs energy of formation G i°. Then, as illustrated in Fig. 4, a reaction is either spontaneous, meaning that the total standard Gibbs energy of formation of the reactants is greater than that of the products (i.e. G i°ϩ G j°Ͼ G i ϩ j°) , or nonspontaneous (i.e. G i°ϩ G j°Յ G i ϩ j°) .  a synthesis reaction ī؉ j 3 i ؉ j. a, for the case that the synthesis reaction is spontaneous (i.e. G i°ϩ G j°Ͼ G i ϩ j°) . b, for the case that it is nonspontaneous (i.e. G i°ϩ G j°Յ G i ϩ j°) .
If one reaction in a reaction pair is spontaneous, the other is nonspontaneous, and vice versa.
According to transition state theory (48), the reactants have to overcome the Gibbs energy of activation (namely ⌬G ϩij ‡ in Fig. 4) for the reaction to occur. In the model, any reaction pair is either low-barrier-the energy barrier (corresponding to ij in Fig. 4) is low and the reaction rate is thus high-or highbarrier. In our system, all reactions are possible, but to define a specific chemical reaction system, we write a list of lowbarrier (and spontaneous) reactions only (e.g. the citric acid cycle shown in Scheme 1 and the formose reaction shown in Scheme 2. Although the real citric acid cycle and the formose reaction involve further reactions, in the simplified model above, only carbon-changing reactions are considered. Specifically, in the Scheme 1 system, we consider each reaction uncatalyzed and neglect the fact that each reaction is catalyzed by specific enzymes. Similarly, in the formose reaction, tetroses further aggregate into a "browning tar" whereby the system would be inhibited to some extent (56). We set aside these complexities to have two easy-to-follow special cases, but our framework does, in general, allow us to describe additional side reactions where necessary.
To be as simple as possible, a strong assumption we make is that molecules with the same mass are identical. In real chemistry, molecules with the same mass may have different atomic compositions (e.g. CO and N 2 ), and even when molecules have the same atomic compositions, they can be different isomers (e.g. ethanol and methoxymethane). Furthermore, our model implies that any type of molecule can be produced by some others (because an integer can always be written as a sum of two integers). But in real chemistry, certain chemical transformations are impossible (e.g. organometallic compounds can never be produced by a chemical reaction system involved with only carbohydrates). Our model currently excludes these real-world possibilities and limitations. However, we emphasize that the "integer molecule mass" in our model does not necessarily correspond to the physical mass, but should be thought of as a rough ordering of the mass or the relative complexity of molecules.
Note that some chemical reaction systems are not physically possible (e.g. the following, shown in Fig. 1c).
By adding up these low-barrier reactions, all molecules are cancelled out. As a consequence, one cannot find proper Gibbs energy for each molecule such that each reaction above is spontaneous. We are only interested in physically possible systems.

Kinetics
We now specify a general model for the kinetics of our chemical system, under the following assumptions: 1) all molecules are ideally gaseous, 2) the whole system is kept at constant pressure and temperature, and 3) every possible reaction is elementary. The derivation follows the law of mass action and transition state theory (48). Here we cover the key points, and a full derivation is given in section S1.
For any synthesis reaction ī ϩ j 3 i ϩ j, the reaction rate in unit s Ϫ1 is as follows, where the subscript ϩij stands for the synthesis reaction, ␤ and are constants, N i is the number of molecules ī in the system, N j is the number of j, S is the number of solvent molecules, determining the global rate at which molecules interact, N is the number of all the molecules except for solvent molecules, and ⌬G ϩij ‡ , as shown in Fig. 4, is defined as follows.
When implementing the model, we set values of ij (positive and finite) for each reaction pair and G i°f or each molecule. Together, these give a unique value for ⌬G ϩij ‡ . Likewise, for any decomposition reaction i ϩ j 3 ī ϩ j, the reaction rate in unit s Ϫ1 is as follows, where the subscript Ϫij stands for this decomposition reaction, and the following is true.
Because the transition state of a reaction pair ī ϩ j 3 i ϩ j and i ϩ j 3 ī ϩ j is identical, by setting ij , both ⌬G ϩij ‡ and ⌬G Ϫij ‡ are uniquely determined. In addition, by setting ij large or small, we can easily make the reaction pair lowbarrier or high-barrier.

Simulation of dynamics
We take the citric acid cycle (Scheme 1) as an example to illustrate how to set up the simulation experiment. 1) Set up all of the constants. Assume that the constant pressure the system is kept at is 100 kilopascals and the constant temperature is 298.15 K, so we obtain that ␤ Ϸ 6.21 ϫ 10 12 s Ϫ1 and Ϸ 0.403 mol/kJ. 2) Set G i°f or each type of molecule (up to 6 in this case), to make sure that the three reactions are spontaneous (i.e. G 5°Ͼ G 1°ϩ G 4°, G 6°Ͼ G 1°ϩ G 5°, and G 2°ϩ G 4°Ͼ G 6°) . Here, we set G 1°ϭ Ϫ780, G 2°ϭ Ϫ500, G 3°ϭ Ϫ490, G 4°ϭ Ϫ190, G 5°ϭ Ϫ830, and G 6°ϭ Ϫ900, all in unit kJ/mol. These values are set in the range of normal chemical substances' Gibbs energy of formation (48,60), roughly in the range between Ϫ1500 and 300. Note that the choice is not unique, and a wide range of choices can be made to allow the system to work. 3) Set 14 ϭ 15 ϭ 24 ϭ 10 for these three low-barrier reactions and ij ϭ 100 for all other reaction pairs (all in units kJ/mol). 4) Assume that

Modeling spontaneous self-replication
there is an unlimited reservoir of resource molecules. In principle, the resource molecule can be assumed to be any type of molecule, but often to be the types that only appear on the reactant side. In this case, we assume it to be molecule 2. In other words, whenever a molecule 2 is consumed or produced, it is replenished or removed so that the number of 2 always keeps as a constant Q ϭ 1000. This setting makes biological sense if we consider a system separated from the unlimited reservoir by a "membrane" or some type of "wall." As long as some resource molecules are consumed, more will enter the system driven by the chemical gradient. 5) We denote the number of molecule ī at time t as N i (t). Initially, N 4 (0) ϭ 1, and N 1 (0) ϭ N 3 (0) ϭ N 5 (0) ϭ N 6 (0) ϭ 0. Molecule 4 triggers the chemical reaction system. We also set S ϭ 1 ϫ 10 6 , which is much larger than Q, so that we are in the dilute limit. We then use the standard Gillespie algorithm to simulate the dynamics (see details in section S2), and the code is accessible (https:// github.com/yuernestliu/Self-replication-simulator). 3 In addition, we construct ODEs to describe the mean-field dynamics (for details, see section S3).