Protein design: the choice of de novo sequences.

Protein design ultimately comes down to choosing amino acid sequences. In contrast to traditional studies of protein folding and function, the choice is not limited to sequences that nature has provided. Advances in molecular biology and synthetic chemistry have made it possible to produce virtually any sequence of amino acids. The number of possible sequences from which to choose is enormous, and even for a small design, one cannot sample all possibilities. For example, for a relatively short sequence of 100 residues composed of the 20 naturally occurring amino acids, there are 20 possibilities. This number is so large (20 . 10) that if one synthesized a single molecule of each sequence and put the entire collection into a box, the resulting box would be larger than Avogadro’s number of universes. Among this astronomical number of possible sequences, the majority will not fold into soluble, globular structures. Indeed, open reading frames constructed of randomly chosen sequences typically produce insoluble material (1–3). Consequently, among the 20 possible sequences, only a small fraction can fold into soluble, globular proteins. Thus, the box of “good sequences” is far smaller than Avogadro’s number of universes. Nevertheless, even this smaller box is astronomical in size. Therefore, to enhance the likelihood of success (and reduce the level of frustration), designs must be confined to desirable neighborhoods of sequence space. What properties define the regions of sequence space that favor soluble, well folded globular structures? All amino acids in a sequence do not contribute equally to the formation of a native structure. Indeed, the tolerance of natural proteins to a variety of amino acid substitutions (4–6) demonstrates that some features are far more important than others. To successfully design proteins de novo, which features are crucial?

Protein design ultimately comes down to choosing amino acid sequences. In contrast to traditional studies of protein folding and function, the choice is not limited to sequences that nature has provided. Advances in molecular biology and synthetic chemistry have made it possible to produce virtually any sequence of amino acids. The number of possible sequences from which to choose is enormous, and even for a small design, one cannot sample all possibilities. For example, for a relatively short sequence of 100 residues composed of the 20 naturally occurring amino acids, there are 20 100 possibilities. This number is so large (20 100 Ͼ 10 130 ) that if one synthesized a single molecule of each sequence and put the entire collection into a box, the resulting box would be larger than Avogadro's number of universes.
Among this astronomical number of possible sequences, the majority will not fold into soluble, globular structures. Indeed, open reading frames constructed of randomly chosen sequences typically produce insoluble material (1)(2)(3). Consequently, among the 20 100 possible sequences, only a small fraction can fold into soluble, globular proteins. Thus, the box of "good sequences" is far smaller than Avogadro's number of universes. Nevertheless, even this smaller box is astronomical in size. Therefore, to enhance the likelihood of success (and reduce the level of frustration), designs must be confined to desirable neighborhoods of sequence space.
What properties define the regions of sequence space that favor soluble, well folded globular structures? All amino acids in a sequence do not contribute equally to the formation of a native structure. Indeed, the tolerance of natural proteins to a variety of amino acid substitutions (4 -6) demonstrates that some features are far more important than others. To successfully design proteins de novo, which features are crucial?

Binary Patterning of Polar and Nonpolar Amino Acids
The global features likely to be most important for designing novel proteins can be inferred from an examination of natural proteins. Such examination reveals two universal themes. First, globular proteins fold into structures that maximize burial of hydrophobic side chains while simultaneously exposing hydrophilic side chains to solvent. Second, these structures typically contain an abundance of secondary structure such as ␣-helices and ␤-sheets. The prevalence of these two features among natural proteins suggests that they play a crucial role in defining regions of sequence space that are most likely to yield soluble, well folded de novo proteins.
Sequences capable of forming regular secondary structure while simultaneously burying hydrophobic side chains (and exposing hydrophilic ones) can be designed by patterning the sequence periodicity of polar and nonpolar residues to match the structural periodicity of the desired secondary structure. For example, to design ␣-helical segments, the periodicity of polar and nonpolar residues would approximate a repeat of 3.6 residues/turn. In contrast, designed ␤-strand segments would be composed of sequences with alternating polar and nonpolar residues (7,8). Kamtekar et al. (9) proposed that such patterning of polar and nonpolar residues might serve as the cornerstone for initial stages of de novo protein design. According to this proposal, only the sequence locations of polar and nonpolar residues must be specified explicitly. The precise identities of the polar and nonpolar residues need not be constrained and can be varied extensively. This proposal gave rise to a design strategy based on a "binary code," which specifies only whether a given residue is polar or nonpolar. The initial test of the binary code strategy focused on the design of four-helix bundles (9). A combinatorial library of de novo sequences was constructed and expressed. All sequences shared the identical pattern of polar and nonpolar residues. However, the combinatorial underpinnings of the strategy yielded a distinct amino acid sequence for each member of the collection. This is shown schematically in Fig. 1. The collection of novel proteins was expressed from a degenerate family of synthetic genes in which polar amino acids (Lys, His, Glu, Gln, Asp, and Asn) were encoded by the degenerate DNA codon NAN, and nonpolar amino acids (Met, Leu, Ile, Val, and Phe) were encoded by the degenerate codon NTN (where N represents a mixture of A, G, T, and C). Application of this combinatorial method to a sequence pattern containing 24 nonpolar positions and 32 polar positions can yield Ͼ10 41 (i.e. 5 24 ϫ 6 32 ) different amino acid sequences. While this is obviously a very large number, it is small compared with the number of sequences that would have been possible if all 56 helical residues were completely randomized (20 56 Ͼ 10 72 ). Limiting the library to sequences that satisfy the binary code reduces the available sequence space by ϳ31 orders of magnitude. This reduction in sequence space vastly increased the likelihood of recovering well folded, water-soluble proteins. Consequently, in contrast to the insoluble material typically generated by totally random sequences (1,2), the majority of binary code sequences gave rise to proteins that were both ␣-helical and water soluble (9).
Natural proteins fold into structures with well packed hydrophobic cores. However, the combinatorial basis of the binary code strategy precludes rational design of specific packing interactions. Can the binary code nonetheless generate novel proteins with native-like properties? Recent experiments in our laboratory indicate that proteins with properties similar to those of natural proteins are indeed found among an initial collection of binary code ␣-helical proteins. Many of them display cooperative thermal denaturations. 1 Furthermore, some proteins in the collection give rise to NMR spectra with significant chemical shift dispersion in both the amide and methyl regions. Most importantly, amide protons are protected from * These minireviews will be reprinted in the 1997 Minireview Compendium, which will be available in December, 1997. This is the third article of five in the "Protein Folding and Assembly Minireview Series." ‡ To whom correspondence should be addressed. exchange with solvent to an extent similar to that seen in some natural proteins. 2 Adherence to the binary patterning reduces the amount of sequence space that must be explored and increases the likelihood of obtaining well folded proteins. However, even this reduced sector of sequence space is enormous, and not all sequences within this space will actually fold into native-like structures. Indeed some do not fold at all (9). Clearly, other features must also contribute to the stability of folded structures. Within the reduced sector of sequence space defined by the binary code, which additional features must be designed explicitly?

Intrinsic Propensities for Secondary Structure
In the 1970s, when the data base of known protein structures was 2 orders of magnitude smaller than it is today, it was already recognized that the 20 amino acids had different statistical biases for one or another type of secondary structure (10). In the ensuing years, the intrinsic propensities that underlie these statistical biases have been probed in model systems ranging from mutant proteins to synthetic peptides and copolymers (e.g. Refs. [11][12][13][14][15][16][17][18][19][20]. Overall, these studies have demonstrated that the statistical trends observed in the data base of known structures correlate quite well with the intrinsic propensities for secondary structure determined from physical measurements in model systems (16).
The magnitudes of the intrinsic propensities are typically modest. Therefore, with the possible exception of proline and glycine, these propensities can be overwhelmed by the context of an amino acid in the overall sequence. For example, a given peptide sequence can form different secondary structure in two different protein contexts (21,22). A dramatic example of this context dependence was demonstrated by Minor and Kim (23), who showed that an 11-amino acid sequence (Ala-Trp-Thr-Val-Glu-Lys-Ala-Phe-Lys-Thr-Phe) formed an ␣-helix when placed in one location but folded as a ␤-strand when placed in an alternative location within the same protein. In related work, the same authors showed explicitly that the "intrinsic" propensities of the 20 amino acids for ␤-structure are not "intrinsic" but context-dependent; different values are obtained depending on whether propensities are measured for a central ␤-strand or an edge strand (20). To explicitly measure the importance of intrinsic propensities relative to context, Xiong et al. (7) designed a series of peptides in which intrinsic propensities favored one type of secondary structure, while binary patterning favored an alternative structure. Characterization of the peptides demonstrated that when the sequence periodicity of polar and nonpolar residues matches the repeat pattern of a particular secondary structure, the peptide forms that structure regardless of the intrinsic propensities of the component amino acids.

Turns
Chain reversals connecting successive elements of secondary structure give rise to the globular appearance of folded protein structures. However, correctly folded globular structures can also form from sequences in which the normal connectivity of the protein has been altered by cleavage (24) or by circular permutation (25)(26)(27). How important are the sequences of turns in determining the overall structure of a protein? Does the amino acid sequence of a turn dictate the location of a chain reversal? If so, then it will be essential to design turn sequences explicitly. Alternatively, are turns merely default structures that occur between elements of secondary structure? If this is the case, then the precise sequences of these stretches of "molecular string" may not need to be designed a priori.
Turns connecting ␣-helices can be structurally quite different from those connecting ␤-strands, and the results obtained in one system may not be directly applicable to those in the other system. Thus it is important to consider both kinds of turns.
Interhelical turns have been studied in several different proteins. Brunet et al. (28) analyzed the turn between the third and fourth ␣-helices in the four-helix bundle, cytochrome b 562 . The natural Glu 81 -Gly 82 -Lys 83 turn was replaced by random tripeptide sequences. All studied variants were shown to form structures similar to wild type. In a similar study on the Asp 30 -Ala 31 -Asp 32 interhelical turn in Rop, 377 of the 380 isolated Rop variants folded properly (29). These results suggest that the sequence of interhelical turns does not dictate the structure of anti-parallel, four-helix bundles.
Is the length of the interhelical region important? Must the length of the turn be designed to disrupt the polar/nonpolar periodicity of the helices bracketing the turn? Vlassi et al. (30) addressed this question by inserting two extra residues into the interhelical turn of Rop. This insertion causes the ␣-helical periodicity of polar and nonpolar residues to be maintained through the entire length of the Rop sequence. Nonetheless, the mutant sequence formed the interhelical turn in the correct location, and the overall three-dimensional structure was identical to wild type. Further evidence that the length of an interhelical turn is not essential comes from experiments demonstrating that a variety of insertions can be tolerated in the interhelical turns of cytochrome b 562 (31,32). These results suggest that helix-helix interactions, not interhelical turns, dictate the overall structure of a four-helix bundle. This suggestion was confirmed by Predki and Regan (33), who used polyglycine linkers of various lengths to connect the helices of Rop in a different order from wild type. The rearranged sequences folded into the correct four-helix bundle and were biologically active.
Although the results with Rop and cytochrome b 562 demonstrate that particular interhelical turn sequences are not essential for maintaining the structure of a protein, turn sequences can affect stability. This was shown explicitly by mutating Asp 30 of Rop to each of the other 19 naturally occurring amino acids. While all 19 variants fold and function correctly, they had a range of stabilities (34). Work on variants of cytochrome b 562 yielded similar results (35).
Recent research suggests that the turns between ␤-strands may not be as tolerant to substitution as interhelical turns. For example, when a type II reverse turn (Pro 47 1. Helix net representation of the four-helix bundles designed by Kamtekar et al. (9). Exposed, polar residues are represented as white circles. Buried, nonpolar residues are represented as black circles. Identities of the turn residues are shown explicitly. The hydrophobic faces of the helices are shaded. Because the exact identity of each side chain within the helices is not specified, the design is based on a binary code, which specifies only whether a given residue is polar or nonpolar.
in the ␤-sheet protein plastocyanin was replaced by random tetrapeptides, the correctly folded, blue copper protein formed in only six of the 98 characterized variants (36). This interstrand turn apparently plays a significant role in determining the fold of the ␤-barrel structure of plastocyanin.
The ability of a structure to tolerate different turn sequences will depend, of course, on the overall stability of the protein. Zhou et al. (37) mutated a five-residue turn connecting the last two ␤-strands in an ␣ ϩ ␤ domain of protein G. They found that the majority of random substitutions was tolerated. However, when the same turn was randomized in a variant of protein G that had already been destabilized by mutations elsewhere in the structure, then a far smaller fraction of turn sequences was tolerated. Thus, tolerance to turn substitution is strongly dependent on the global stability of the host protein.
Is it crucial that the chain reversals in novel proteins be designed explicitly? While turn sequences can affect protein stability, effects are likely to be small and dependent on context. If a design is otherwise robust, particular turn sequences may not need to be designed explicitly, especially in four-helix bundles. However, for the design of more complex structures or for designs that are only moderately stable, it is prudent to specify "good" turn sequences. Indeed, careful attention to the design of turns has had dramatic effects for two recent designs. The betabellin protein was stabilized by the incorporation of non-natural D-amino acids to favor inverse common (type IЈ) turns (38). Likewise, incorporation of D-amino acids to favor a type IIЈ ␤-turn played a key role in stabilizing the native-like ␤␤␣ structure of a 23-residue de novo sequence (39).

Packing
Is good packing essential? Must it be explicitly designed a priori? We suggest the answers to these questions are (respectively): most certainly yes; and probably no.
The importance of good packing in natural proteins is evident both from analyzing their structures and mutating their sequences. The hydrophobic cores of wild-type structures are invariably well packed with densities approaching those seen in crystals of small organic molecules (40). Mutations that reduce packing density typically render a protein less stable (e.g. Refs. [41][42][43], while those that improve packing yield proteins with enhanced stability (44).
Packing also plays an important role in the structures of de novo proteins. By using different arrangements of nonpolar residues to redesign the hydrophobic core of Rop, Munson et al. (45,46) demonstrated that size, shape, and relative location of side chains can specify both the stability and the "native-like" properties of a protein. Underpacking yielded proteins that were not stable, whereas overpacking yielded structures that were stable but not native-like.
The determinants of good packing are not merely size, and the effects of altered packing can be quite dramatic. For example, Harbury et al. (47) showed that a coiled-coil peptide with Ile in the "a" positions and Leu in the "d" positions forms dimers, while the analogous peptide in which these residues have been reversed forms tetramers. In these and related peptides, the shape (rather than size or hydrophobicity) of nonpolar side chains dictates whether peptides associate to form two-, three-, or four-stranded structures (47). Further work on coiled coils has shown that the drive toward good packing can serve as the basis for engineering allostery. Variants of the GCN4 coiled coil were designed to recruit small nonpolar molecules from solution to fill a hole and thereby stabilize a threestranded relative to a two-stranded coiled coil (48).
Packing is clearly important. Must it be designed a priori? This question can be addressed using metaphors invoked by Bromberg and Dill (49). They compare side chain packing to either a jigsaw puzzle model with "lock-and-key fits in which there is specific pairwise matching of complementary side chains" or to "a nuts and bolts model in which side chains pack together without specificity" (see Fig. 2). Both models ultimately yield good packing. However, for the jigsaw puzzle, complementarity must be designed explicitly by the manufacturer of the puzzle (i.e. the protein designer). However, in the nuts and bolts model nonspecific forces (e.g. the size of the jar holding the nuts and bolts or the hydrophobic effect driving the collapse of a polypeptide chain) lead to compaction, and high packing density results from promiscuous surfaces finding a way to nuzzle up against one another.
The nuts and bolts model is supported by two lines of evidence. First, analysis of known protein structures showed that preferred interactions among hydrophobic side chains are not observed (50). Based on this observation, Behe et al. (50) suggested that although packing is an indispensable prerequisite for the native conformation, it does not serve as the causal agent for the native conformation. Indeed, they conclude that "high packing densities are readily attainable among clusters of the naturally occurring hydrophobic amino acid residues." The second type of evidence supporting the nuts and bolts model comes from mutagenesis studies. For example, Axe et al. (51) reconstructed the entire hydrophobic core of barnase with random nonpolar residues and found that ϳ23% of the variants retained enzymatic activity. Although the detailed structures of these proteins have not been reported, it is clear that functioning enzymes can be isolated without specifying the details of a jigsaw puzzle. Similarly, our own work using the binary code to construct novel sequences has shown that patterning of polar and nonpolar residues can yield compact ␣-helical structures (9), which in some cases possess native-like properties. 1,2 These results demonstrate that although good packing is important, it is possible to construct native-like de novo proteins without explicitly designing all of the tertiary interactions a priori.

Negative Design
Merely designing favorable interactions in the folded state of a protein is not sufficient to generate a unique structure. It is equally important to design against competing alternatives. This is sometimes described as "negative design" (52). Some examples of negative design are fairly simple, such as the incorporation of a glycine or proline to disfavor continuation of a helix and thereby enhance the likelihood of a desired turn (see Fig. 1 and Refs. 52 and 53). Others are more subtle. For example, inclusion of polar residues at key positions throughout a sequence can disfavor "wrong" hydrophobic cores and thereby favor the formation of the desired unique structure (54). Experimental support for this suggestion comes from the work of Raleigh et al. (55), who showed that incorporation of polar residues at the interface between buried and exposed ␣-helical surfaces enhances the native-like properties of their ␣ 2 peptide.

Conclusions
Designing proteins necessitates choosing sequences. Do we understand proteins well enough to make successful choices? If the goal is to define regions of sequence space that yield stable, water-soluble, ␣-helical proteins, then the answer is unequivocally "yes." In the past decade we have progressed from a time when the only available proteins were those isolated from natural organisms to a time when de novo proteins have become the focus of new journals and symposia.
Nonetheless, many challenges remain; ␤-sheet proteins and mixed ␣/␤ proteins are considerably more difficult to design (56). Even for ␣-helical structures, successful design of novel molecules that recapitulate all the thermodynamic, structural, and functional properties of natural proteins remains a difficult challenge. Sequences capable of folding into precise structures that possess high levels of enzymatic activity will be found only rarely in the overall "box" of sequence space. The challenge to devise such molecules will both enhance our understanding of natural proteins and refine our ability to choose de novo sequences.