Multiple Post-translational Modifications Affect Heterologous Protein Synthesis*

Background: Post-translational modifications (PTMs) affect protein folding. Results: Statistically significant correlations are revealed between the yield of heterologous protein expression and the presence of multiple PTM sites bioinformatically predicted in the expressed sequences. Conclusion: Predicting potential PTMs in polypeptide sequences can help optimize heterologous protein synthesis. Significance: Correlations revealed provide insights into the role of specific PTMs in protein stability and solubility. Post-translational modifications (PTMs) are required for proper folding of many proteins. The low capacity for PTMs hinders the production of heterologous proteins in the widely used prokaryotic systems of protein synthesis. Until now, a systematic and comprehensive study concerning the specific effects of individual PTMs on heterologous protein synthesis has not been presented. To address this issue, we expressed 1488 human proteins and their domains in a bacterial cell-free system, and we examined the correlation of the expression yields with the presence of multiple PTM sites bioinformatically predicted in these proteins. This approach revealed a number of previously unknown statistically significant correlations. Prediction of some PTMs, such as myristoylation, glycosylation, palmitoylation, and disulfide bond formation, was found to significantly worsen protein amenability to soluble expression. The presence of other PTMs, such as aspartyl hydroxylation, C-terminal amidation, and Tyr sulfation, did not correlate with the yield of heterologous protein expression. Surprisingly, the predicted presence of several PTMs, such as phosphorylation, ubiquitination, SUMOylation, and prenylation, was associated with the increased production of properly folded soluble proteins. The plausible rationales for the existence of the observed correlations are presented. Our findings suggest that identification of potential PTMs in polypeptide sequences can be of practical use for predicting expression success and optimizing heterologous protein synthesis. In sum, this study provides the most compelling evidence so far for the role of multiple PTMs in the stability and solubility of heterologously expressed recombinant proteins.

Heterologous protein synthesis is widely employed for production of recombinant proteins. Rather commonly, the eukaryotic proteins and their domains are expressed in the Escherichia coli bacterial cells (1)(2)(3) or cell-free extracts (3)(4)(5)(6). However, only a minor fraction of all heterologous proteins can be successively expressed in this host system. The correct folding of eukaryotic proteins in a bacterial host remains a great challenge for their synthesis. To address this issue, eukaryotic expression systems based on the use of yeast (7), wheat germ (8), insect cells (9), rabbit reticulocytes (10), tumor HeLa cells (11), and hybridoma (12) have been developed; however, they tend to produce the proteins in relatively low yields that are often insufficient for structural and/or functional studies.
At present, the factors determining expression success of heterologous protein synthesis are poorly understood. Various physicochemical and structural features of amino acid sequences have been implicated as determining factors of soluble protein expression in a bacterial host (2,(13)(14)(15). Computational approaches to predict protein propensity for expression and solubility have been developed (13,16,17). Most recently, a number of statistically significant correlations between the yield of heterologous cell-free protein synthesis and multiple calculated and predicted parameters of amino acid sequences have been reported (18).
Importantly, many eukaryotic proteins require multiple PTMs 2 to reach a native, biologically active conformation. PTMs can significantly change the integral characteristics of proteins that affect their stability and solubility, such as charge, hydrophobicity, solvent accessibility, etc. Thus, in addition to the physicochemical and structural features of amino acid sequences, PTMs should also be considered as the major determinants of successful protein synthesis. The presence of specific sequence motifs encrypting modification sites in target proteins and the capacity of the employed expression system to carry out these modifications are the prerequisites for PTM occurrence. Notably, the bacterial expression systems have only a limited capacity for PTMs. The inability of heterologous protein synthesis to support all PTMs that a protein requires to fold is considered to be a major factor behind the low expression yield and pure solubility of many recombinant proteins. Thus, eukaryotic proteins produced in the bacterial expression systems are quite often misfolded or unfolded, leading to their deposition into insoluble aggregates or fast degradation. Quantifying the yields of soluble and insoluble expression provides a clue to the evaluation of folding and stability of the synthesized polypeptide products. In contrast to the numerous analyses addressing the correlations between heterologous protein synthesis and physicochemical properties of the expressed polypeptides, no systematic study concerning the correlations of heterologous protein expression with multiple PTMs has been presented.
In this study, to gain an insight into the role of multiple PTMs in the stability and solubility of heterogeneously synthesized proteins, we evaluated the expression of 1488 human proteins and their domains in a bacterial cell-free system, and we examined the correlations of the reaction yield with the presence of multiple PTM sites bioinformatically predicted in the expressed sequences. The obtained information was accumulated in the database of the structural genomics/proteomics project "Protein 3000," launched in Japan in the year 2002 with the aim to determine the structures of 3000 proteins using NMR and x-ray analyses (19 -21).

EXPERIMENTAL PROCEDURES
Protein Expression and Its Evaluation-Cell-free expression of human proteins and their domains was carried out in the E. coli S30 extracts as described previously (18). The main steps of the heterologous protein production are shown schematically in Fig. 1. All proteins were expressed under the same uniform set of conditions to minimize the influence of sequenceindependent factors. No solubility-enhancing compounds or chaperones were included in the reaction mixture of protein synthetic reaction. All synthesized polypeptide products universally comprised the N-terminal poly-His tag. Proteins that were expressed at a lower molecular weight than expected were considered to be nonexpressed. Scores A, C, and N were assigned to all experimentally expressed proteins as follows. A, soluble proteins expressed at the levels of more than 0.1 mg/ml; C, expressed but insoluble proteins; and N, nonexpressed proteins (expression yield of Ͻ0.1 mg/ml). Notably, score A provided the upper estimation of soluble protein expression, and score C gave the lower estimation of insoluble expression, because the procedure of centrifugation at 10,000 ϫ g used to separate soluble and insoluble proteins in this study cannot discriminate between truly soluble proteins and small protein aggregates.
Dataset-A complete dataset of expressed polypeptides included 1488 nonredundant human amino acid sequences. Similar sequences were filtered out at 95% identity. The length of analyzed sequences did not exceed 350 amino acids. Proteins of different functional and structural classes were represented in the dataset.
Bioinformatics Prediction of PTMs-The sites of PKC phosphorylation, CK2 phosphorylation, tyrosine phosphorylation, PKA phosphorylation, myristoylation, asparagine glycosyla-tion, amidation, prenylation, sulfation, and aspartyl hydroxylation were predicted with the PROSITE scanning tool PS_SCAN (22,23). The service searches multiple protein sequences with multiple patterns from the PROSITE database. Sites of S-palmitoylation were predicted with the CSS-Palm 3.0 tool (24). Sites of N-terminal acetylation were predicted using the NetAcet 1.0 server (25). N-terminal methionine excision was analyzed using the Terminator 3 server (26,27). Disulfide bonds were computed with the DIpro tool (28). Sites of ubiquitination were predicted using the predictor of protein ubiquitination UbPred (29). Sites of SUMOylation were predicted with the site-specific predictor SUMOsp 2.0 (30). The grand average of hydrophobicity (GRAVY) index was calculated using free software available at ExPASy server. Solvent accessibility was computed with the tool provided on line, and the contents of charged residues were calculated using Proteomix software (31).
Homology Modeling-Homology modeling of the dataset proteins harboring the sites of aspartyl hydroxylation was carried out using the protein structure homology-modeling server SWISS-MODEL (32)(33)(34). Visualization and rendering of the modeled structures was done with PyMOL (35).
Statistical Analysis of Data-In this study, all expressed proteins were categorized into three mutually exclusive classes as follows: soluble (A), insoluble (C), and nonexpressed (N) proteins. Thus, the expression data represented, in essence, a categorical dataset. To estimate statistical significance of the observed correlations, the categorical data analysis has been applied via implementation of the two-way contingency table (36). The Fisher's exact p value was computed using an on line tool. A confidence level of 95% was set up as the null hypothesis rejection threshold.

RESULTS AND DISCUSSION
Overview of Protein Expression Pipeline-The specific functions of individual PTMs can be explored by analyzing heterologous protein production in the expression system that does not support these PTMs. In this study, human proteins and their domains were expressed in the cell-free extracts of E. coli. The workflow of the heterologous protein production is presented in Fig. 1. Linear DNA templates were amplified by the two-step PCR from the nonredundant source human cDNA clones selected from the RIKEN cDNA collection. The linear PCR products thus generated were used to program batchmode coupled transcription/translation protein synthesis in the bacterial S30 extracts under the same uniform set of conditions. Soluble and insoluble reaction products were separated by centrifugation and subjected to SDS-PAGE and protein staining to evaluate the yields of soluble and insoluble protein expression. Each experimentally expressed protein was categorized into one of the following three groups: soluble (A), insoluble (C), and nonexpressed proteins (N), as described under "Experimental Procedures." Overall estimation of protein expression showed that the proteins of group A represented 30.9% (460); the proteins of group C represented 52.6% (782), and the proteins of group N represented 16.5% (246) of all proteins in the dataset. Similar rates of soluble expression have been reported previously for human proteins expressed in E. coli (37) and in cell-free bacterial extracts (18).

PTMs Whose Occurrence Negatively Correlates with Soluble
Protein Expression-Predictive bioinformatics analysis carried out on the dataset of 1488 human polypeptide sequences expressed in the cell-free bacterial system revealed several PTMs associated with significantly worsened protein amenability to heterologous expression. They included Asn glycosylation, S-S bond formation, myristoylation, and palmitoylation ( Fig. 2, Table 1, and supplemental Table S1). All of them were highly abundant PTMs that could be predicted repeatedly in many proteins of the studied dataset ( Fig. 2, B, D, F, and H).
As expected, the predicted presence of N-glycosylation sites in the dataset sequences was found to be associated with a lower rate of soluble expression and a higher rate of insoluble expression ( Fig. 2A, curves A and C, respectively). These correlations were statistically confirmed (Table 1 and supplemental Table  S1). The observed tendencies are most likely related to the fact that the N-linked protein glycosylation pathway is absent from the E. coli-based system of protein synthesis used in this study. N-Linked protein glycosylation was shown to be important for the folding, stability, trafficking, and pharmacokinetics of many proteins (38). The increased folding efficiency of glycosylated proteins is attributed to the chaperone-like activity of glycans, which can be observed even when glycans are not covalently linked to proteins (39). In addition, glycosylation increases the overall hydrophilicity of nascent polypeptide chains, preventing their potential aggregation due to intermolecular hydrophobic interactions. Importantly, although the bacterial N-linked protein glycosylation pathway (Pgl) has been discovered in the ⑀-proteobacterium Campylobacter jejuni and the characteristic enzyme of the pathway, PglB, was found in at least 49 bacterial species (40,41), N-glycosylation does not occur in E. coli, and only the transfer of N-linked glycosylation systems to this bacterium enables the production of recombinant glycoproteins.
In accordance with our previous results (18), the predicted presence of disulfide bonds in proteins was found to be negatively correlated with soluble protein expression and positively correlated with the insoluble expression (Fig. 2C, curves A and C). It comes as no surprise, considering that the disulfide bridges effectively stabilize protein molecules by linking distant parts of polypeptide chains. However, the formation of S-S bonds in human proteins is greatly compromised in bacterial expression systems, as the reducing conditions and oxidative folding in bacteria differ from those in eukaryotes (42). It has been established that the formation of intra-and intermolecular disulfides is not possible in the reducing cytoplasm of wildtype E. coli, resulting in the aggregation of some disulfide bondrich proteins, e.g. Fab antibody fragments (43). Practically, it is often possible to optimize the conditions for the S-S bond formation in a given polypeptide by adjusting reducing conditions of the protein synthetic reaction in the cell-free systems of heterologous protein synthesis. The pretreatment of bacterial extracts with iodoacetamide was shown to abolish the disulfide reducing activity of bacterial extracts, resulting in the recovery of oxidizing redox environment (44).
Similarly to N-glycosylation and disulfide bond formation, the predicted presence of the lipid modification sites, such as myristoylation and palmitoylation, was found to be associated with worsened soluble expression and increased rates of nonexpressed and insoluble-expressed proteins (Fig. 2, E and G, Table 1, and supplemental Table S1). The most plausible explanation for the observed tendencies may be related to the fact that the amino acid sequences with the predicted lipid modification sites have far higher-than-average overall hydrophobicity, typical of the membrane-localized and membrane-associated proteins. The average values of the GRAVY parameter in the subsets of modified and unmodified proteins included Ϫ0.398 and Ϫ0.603 for myristoylation subsets and Ϫ0.355 and Ϫ0.493 for palmitoylation subsets. The highest GRAVY value in the C-ranked targets (data not shown) suggests that this parameter is primarily related to protein solubility. Previously, high surface hydrophobicity has been shown to increase protein aggregation due to intermolecular hydrophobic interactions and to worsen soluble protein expression yield in bacterial expression systems (16,18,45,46). Therefore, the amino acid sequences with the predicted lipid modification sites constitute a subset of very hydrophobic proteins, which have an intrinsically low potential for soluble expression.
PTMs Whose Occurrence Does Not Correlate with Protein Expression-In addition to the abovementioned modifications, the predicted presence of aspartyl hydroxylation sites seemingly worsened expression amenability of amino acid sequences. Indeed, prediction of this PTM decreased the rate of soluble expression and increased the proportion of nonexpressed proteins (Fig. 3A, columns A and N, respectively). Aspartyl hydroxylation was found to be a relatively low abun-dant PTM; only 12 proteins in the analyzed dataset displayed this feature. The categorical data analysis showed that the two subsets of data, ASX(ϩ) and ASX(Ϫ), were not significantly different at the 95% confidence level (Table 1 and supplemental  Table S1); therefore, this modification was categorized into the group of PTMs whose presence does not correlate with the protein expression amenability in this expression system.
Notably, the PROSITE scanning tool employed in this study for prediction of multiple PTMs, including aspartyl hydroxylation, is based on the identification of predefined PTM consensus signature patterns in linear amino acid sequences. The tool does not take into account spatial accessibility of the potential PTM sites, suggesting that at least a fraction of these sites cannot be modified as they may be located in the inaccessible parts of protein molecules. This should result in a number of falsepositive predictions, obscuring the correlations between the protein expression yield and the presence of the PROSITE-predicted PTM sites. Importantly, misidentification of even a single modification site may be crucial in the case of low abundant PTMs, such as aspartyl hydroxylation, because it can bring about a statistically significant difference in the analyzed data subsets.
To address this issue, solvent accessibility of the predicted sites of aspartyl hydroxylation has been investigated. Threedimensional structures of the 12 dataset proteins harboring this modification were built using structural homology modeling, and the locations of the modification sites were mapped in the modeled structures. Importantly, all of the predicted aspartyl hydroxylation sites were found to be located in the solventaccessible regions of protein molecules (supplemental Fig. S1), suggesting their validity and functional relevance. This result rules out the presence of false-positive PROSITE-predicted aspartyl hydroxylation sites in the analyzed dataset, thereby reinforcing the conclusions of the categorical data analysis.
Still, the existence of negative correlation between aspartyl hydroxylation and heterologous protein synthesis cannot be completely ruled out at this stage, and the study of a more extended dataset is necessary to further clarify the effect of this PTM. The expression analysis of the ankyrin repeat domaincontaining proteins, the recently identified substrates of hydroxylation by FIH hydroxylase (47), may help address this point.
About 18% of human polypeptide sequences in the analyzed dataset have been predicted to contain the amidated C termi-nus (Fig. 3B). Amidation prevents ionization of the C terminus, rendering it more hydrophobic, with implications for protein solubility. The exact biological role of this PTM is largely unknown. It has been suggested that C-terminal amidation may contribute to protein stability (38). The bioinformatics analysis performed in this study could not reveal any statistically significant differences in the ratios of soluble, insoluble, and nondetectable expression in the subset of proteins predicted to harbor this PTM in comparison with the subset of unmodified proteins (Fig. 3B, Table 1, and supplemental Fig. S1). This result implies  that, rather than stability and solubility, amidation may affect protein function. Indeed, the studies of several amidated peptides indicate that the amide moiety may be a key determinant of ligand-receptor interactions (48 -50). Similarly, we could not reveal any statistically significant correlations between the predicted presence of tyrosine sulfation sites and heterologous protein synthesis (Fig. 3C, Table 1, and  supplemental Table S1). In total, about 29% of human polypeptide sequences in the analyzed dataset have been predicted to contain sulfated Tyr residues (Fig. 3D), which largely agrees with the previous estimation that the mouse genome may encode over 2000 Tyr-sulfated proteins (51). Tyr sulfation is often observed in bioactive peptides, such as neuropeptides, peptide hormones, conotoxins, etc., isolated from different vertebrate and invertebrate organisms (52). However, Tyr sulfation does not occur in prokaryotes and unicellular eukaryotes (51). In most cases, this modification is required for full biological activity of the peptides. Structural data indicate that sulfo-Tyr is involved in hydrogen bonding networks and salt bridge interactions (53). Similarly to amidation, the lack of correlation between the predicted presence of sulfation sites and heterologous protein synthesis suggests that Tyr sulfation has little to do with protein stability and solubility in the employed expression system.
Quite expectedly, we failed to observe any statistically significant correlations between protein synthesis efficiency and the predicted presence of the N-terminal modifications, such as N-terminal methionine excision, myristoylation, and acetylation (supplemental Fig. S2 and supplemental Table S2), because all synthesized polypeptide products in this study universally composed the N-terminal poly-His tag to allow their purification. Evidently, the expression of untagged or C-terminaltagged polypeptides should be analyzed to deduce the correlations between the presence of N-terminal modifications and the yield of heterologous protein synthesis.
PTMs Whose Occurrence Positively Correlates with Soluble Protein Expression-The bacterial cell-free expression systems do not support multiple PTMs that eukaryotic proteins require to properly fold. This is considered to be a major factor behind the low expression yield and pure solubility of many recombinant proteins. Unexpectedly, the predicted presence of several PTMs, such as prenylation, ubiquitination, SUMOylation, and phosphorylation, was found to be associated with the increased production of properly folded soluble protein. As a rule, this tendency was accompanied by a reciprocal decrease in the rate of insoluble expression.
Among these PTMs, prenylation was found to be a quite low abundant modification, and only 15 proteins in the analyzed dataset have been predicted to contain the potential prenylation sites (Fig. 4A). Nevertheless, categorical data analysis confirmed a statistically significant difference in the ratios of soluble proteins in the two expression subsets, Pre(ϩ) and Pre(Ϫ) ( Table 1 and supplemental Table S1). The low abundance of prenylated proteins in the analyzed dataset is consistent with the previous estimate, which put the number of possible prenylated proteins in the mammalian proteome to less than 2% (54,55). Prenyl groups were shown to be involved in protein-protein interactions through specialized prenyl-binding domains (56,57). In addition, the long chain hydrophobic prenyl groups increase net hydrophobicity of proteins and facilitate their attachment to cell membranes. Therefore, eukaryotic proteins synthesized in the bacterial expression system, which does not support this PTM (56), will be less hydrophobic than those synthesized in a eukaryotic system, and they should be preferentially retained in the cytosolic (i.e. soluble) fraction. This may explain the increased rate of soluble expression observed in the subset of human proteins predicted to contain potential prenylation sites (Fig. 4A).
Other PTMs, whose predicted presence was found to be associated with the increased yield of soluble protein product, included such highly abundant protein modifications, as ubiquitination and SUMOylation. No SUMOylation and ubiquitination machineries are known to exist in bacteria, suggesting that the presence of ubiquitination and SUMOylation sites in amino acid sequences may be associated with the intrinsically better solubility of these sequences even in the physical absence of the modifications. We hypothesized that, most probably, it should be related to the physicochemical and/or structural characteristics of the modification sites themselves. In this connection, ubiquitination has been established to occur on lysine side chains of the acceptor substrate. Although ubiquitin attachment sites were resolved for some subsets of proteins, little sequence consensus was detected, implying that specific residues surrounding the modified lysine are not important determinants for ubiquitination (29,58). Nevertheless, it has been reported that the pronounced feature of ubiquitination sites is the abundance of charged and polar amino acids, especially negatively charged Asp and Glu, and the depletion of hydrophobic residues, such as Leu, Ile, Phe, and Pro around these sites. In addition, ubiquitination sites display the increased solvent accessibility and high propensity for intrinsic disorder (29). Importantly, the high content of charged residues, low hydrophobicity, increased solvent accessibility, and high content of intrinsically disordered sequences have all been implicated as the factors that augment soluble protein synthesis in the expression system used (18). These factors may account for the increased yields of soluble expression observed for the sequences predicted to contain multiple ubiquitination sites.
The physicochemical and structural features of SUMOylation sites have not been investigated in depth. Although the catalytic mechanisms of ubiquitination and SUMOylation are similar, the sites of ubiquitination and SUMOylation are not the same. At the functional level, SUMOylation often acts antagonistically to ubiquitination and serves to stabilize proteins rather than to target them for degradation. Most SUMOmodified proteins have been shown to contain the tetrapeptide consensus motif ⌿KX(D/E) (where ⌿ is Ala, Ile, Leu, Met, Pro, Phe, or Val, and X is any amino acid residue), although the existence of some nonconsensus sites has also been reported (30,59). The above consensus sequence suggests that the pro-teins with multiple sites of SUMOylation may have an increased percentage of charged residues. Indeed, the average content of charged residues in the analyzed subsets of modified and unmodified proteins were significantly different, composing 29 and 24%, respectively. This observation may provide a plausible explanation for the better solubility of the polypeptide sequences containing multiple predicted sites of SUMOylation. Notably, SUMO and ubiquitin fusion tags have recently been employed to express recombinant proteins in E. coli. These fusions often improve solubility and stability of the expressed proteins (60,61).
Correlations of Protein Expression with the Presence of Phosphorylation Sites-One of the most interesting findings of this study was the revelation of positive correlation between the predicted presence of phosphorylation sites and protein amenability to heterologous cell-free expression. Phosphorylation has been established to greatly influence protein structure and function; thus, it could be expected that the reduced capacity of the bacterial expression system for phosphorylation would complicate the correct folding of heterologously synthesized proteins, resulting in the decreased yields of their soluble expression. Quite unexpectedly, the opposite tendency has been observed (Fig. 5A, curve A). It was accompanied by a decrease in the rates of insoluble and nondetectable protein expression (Fig. 5A, curves C and N).
Similarly to the case of ubiquitination and SUMOylation, we reasoned out that the predicted presence of phosphorylated sites in proteins should confer the intrinsically better solubility to these proteins, even if they remain unmodified, and, most probably, this property is related to the physicochemical and/or structural characteristics of the phosphorylation sites themselves. Notably, phosphorylation can occur on multiple sites in a given protein. Early estimates suggested that the majority of human proteins may be phosphorylated at multiple sites (in total Ͼ100000 sites) (62). In agreement with this, the total number of phospho-sites predicted by the PROSITE PS_SCAN scanning tool in the studied dataset was 12,308. The average number of phospho-sites per protein in the dataset was 8.27, and only 8 proteins in the dataset have been found not to harbor this modification (Fig. 5B).
Evidently, due to the nature of the PROSITE prediction algorithms, which do not take into account spatial accessibility of modification sites, not all of the predicted sites can be accessible to phosphorylation. This may bring about a number of falsepositive predictions, as discussed above for the case of aspartyl hydroxylation. Interestingly, the solvent accessibility of the predicted phosphorylatable residues calculated for a subset of phosphorylated proteins composed 81.1%. This value is significantly higher than the average solvent accessible surface of the proteins in the expression dataset (ϳ51.8%). Still, it is difficult to correctly evaluate the functional accessibility of the predicted phosphorylation sites as some of them may be rendered accessible after interactions with their effector protein kinases.
Phosphorylation of eukaryotic proteins on multiple sites is carried out by numerous divergent members of the eukaryotic protein kinase superfamily, including Ser/Thr-and Tyr-specific protein kinases (63). No single substrate consensus sequence can be provided for these enzymes. More than 500 putative protein kinase genes have been identified in the human genome (64). Together, the most processive Ser/Thr-specific protein kinases, such as PKA, PKC, PKG, CK2, CamII, and Cdk1, account for phosporylation of more than 90% of all phosphorylatable residues in proteins. Their corresponding PROSITE consensus patterns are as follows: RX 1-2 (S/T)X,  AUGUST 3, 2012 • VOLUME 287 • NUMBER 32 X(S/T)X(R/K), (R/K) 2-3 X(S/T)X, X(S/T)XX(D/R), RXX(S/T)X, and X(S/T)PX(R/K) (65). Evidently, the specificity of these kinases is directed by basic Lys and Arg residues in a close proximity to the acceptor residue. This fact suggests that the proteins with multiple phosphorylation sites may have an increased ratio of charged residues and a higher solvent accessibility associated with a larger solvent exposure of charged residues. Indeed, a direct correlation has been observed between the predicted number of phosphorylation sites and the content of charged residues/solvent accessibility (supplemental Fig. S3,  A and B). Previously, these parameters were found to be positively correlated with the soluble protein expression (18). Considering the similarity of the substrate consensus sequences, it comes as no surprise that protein amenability to soluble expression correlates positively with the predicted presence of the specific phosphorylation sites for PKC, PKA, and CK2. The amenability to insoluble expression correlates negatively with the presence of these sites (Fig. 5, C and E, supplemental Fig. S4, and supplemental Table S3).

Role of PTMs in Heterologous Protein Expression
Interestingly, these correlations have not been observed for the proteins with the predicted sites of Tyr phosphorylation ( Fig. 5G; and supplemental Table S3). One explanation for this may be the drastic difference in the amino acid composition of Tyr and Ser/Thr phosphorylation sites. The specificity of Tyr kinases is dominated by acidic, basic, and hydrophobic residues adjacent to the acceptor residue; however, a large variation makes it difficult to deduce a generalized consensus sequence for these sites (66,67). Moreover, in contrast to the polypeptide sequences with the predicted sites of Ser/Thr phosphorylation, no tendency to the increased content of charged residues and solvent accessibility has been observed for the proteins with predicted sites of Tyr phosphorylation. The values of the average solvent accessible surface for the proteins in the analyzed subsets of Tyr-phosphorylated and unphosphorylated proteins were very close (51.5 and 51.9%, respectively).
In addition, Tyr phosphorylation is a relatively rare PTM. The total number of the predicted Tyr phospho-sites in the expression dataset was estimated to be 568, which represents 4.61% of the total number of phospho-sites in the dataset. This figure agrees well the estimate (ϳ4%) provided by the "Kinexus" protein phosphorylation resource. No more than two sites of Tyr phosphorylation could be predicted in any dataset protein (Fig. 5H). Plausibly, in general, a single event of phosphorylation cannot change significantly the integral characteristics of a protein molecule, which define its solubility. Thus, the low abundance of Tyr phosphorylation may be related to the negligible effect of this PTM on the yield of heterologous protein expression.
The existence of correlations presented in supplemental Fig.  S3, A and B raises a question about the extent to which the algorithms for PTM prediction may reflect physicochemical properties of proteins. Evidently, some PROSITE consensus patterns are dominated by the amino acids of a certain type, charged, polar, hydrophobic, etc. The presence of multiple predicted PTM sites in a protein molecule should confer the properties associated with these sites to the whole molecule. For example, a direct correlation between the number of phosphorylation sites and the content of charged residues in proteins (supplemental Fig. S3A) is based on the presence of charged residues in the multiple consensus sequences of phosphorylation sites. However, the consensus patterns of some other PTMs cannot confer the unique features to proteins because they are low abundant and lack the prominent sequence determinants associated with specific physicochemical properties. For instance, the PROSITE consensus sequence for asparagine glycosylation is NX(S/T) (65). It is short, hydrophilic neutral, and relatively low abundant; most of the N-glycosylated proteins contain only a single site for this modification (Fig. 2B). Therefore, its presence should have only a minor impact on the total amino acid composition of a protein. Nevertheless, the polypeptides with predicted sites of N-glycosylation have a significantly lower propensity for soluble expression and a higher propensity for insoluble expression ( Fig. 2A). Importantly, protein folding in the region of the consensus site was shown to play an important role in the regulation of glycosylation. This region adopts a specific conformation during the glycosyl transfer reaction, although a defined secondary structure seems not to be essential for recognition of the substrate by the oligosaccharyltransferase (68). Thus, it is plausible that the PROSITE algorithm for prediction of N-glycosylation sites may inadvertently cover some important structural determinants associated with these sites and linked to overall protein stability and solubility. Similarly, certain conformational properties may also be associated with the consensus sequences of other PTMs. At present, the structural prerequisites of most PTMs remain largely unknown.
In conclusion, our study demonstrates that the amenability of human polypeptide sequences to heterologous cell-free expression correlates with the presence of multiple PTM sites bioinformatically predicted in these sequences ( Table 2). For each analyzed PTM, a credible explanation for the existence of observed correlations is presented. These findings provide a plethora of important insights into the role of multiple PTMs in the stability and solubility of heterogeneously expressed proteins. In a practical sense, identification of potential PTM sites in the polypeptide sequences can be used for predicting expression success and optimizing heterologous protein synthesis. A

TABLE 2
Correlations of heterologous protein expression with PTMs "ϩ" and "Ϫ" indicate positive and negative correlations, respectively; O denotes the lack of correlation; and "Ϯ" refers to the opposite tendencies of expression estimates at different values of calculated parameters.

PTM Soluble Insoluble Nonexpressed
a Data are statistically significant when the number of myristoylation sites exceeds 4. b Data are statistically significant when the number of phosphorylation sites exceeds 12.
combination of multiple PTMs, whose occurrence correlates most robustly with protein amenability to heterologous cellfree expression, may be used to set up the computative algorithms for predicting the outcome of protein synthetic reactions. In addition, a number of calculated physicochemical and structural characteristics of polypeptide sequences can serve as the parameters for predicting expression success (18). Currently, we are developing a discriminant-based machine-learning algorithm that utilizes multiple features of amino acid sequences to predict the success rate of heterologous protein expression.