Structural and Genomic Correlates of Hyperthermostability*

While most organisms grow at temperatures ranging between 20 and 50 °C, many archaea and a few bacteria have been found capable of withstanding temperatures close to 100 °C, or beyond, such as Pyrococcus or Aquifex. Here we report the results of two independent large scale unbiased approaches to identify global protein properties correlating with an extreme thermophile lifestyle. First, we performed a comparative proteome analyses using 30 complete genome sequences from the three kingdoms. A large difference between the proportions of charged versuspolar (noncharged) amino acids was found to be a signature of all hyperthermophilic organisms. Second, we analyzed the water accessible surfaces of 189 protein structures belonging to mesophiles or hyperthermophiles. We found that the surfaces of hyperthermophilic proteins exhibited the shift already observed at the genomic level,i.e. a proportion of solvent accessible charged residues strongly increased at the expense of polar residues. The biophysical requirements for the presence of charged residues at the protein surface, allowing protein stabilization through ion bonds, is therefore clearly imprinted and detectable in all genome sequences available to date.

The number of solved protein structures from hyperthermophiles is increasing at a fast pace (Refs. 1-13 and references therein), due in part to structural genomics approaches. Based on comparisons between mesophiles and hyperthermophiles, two main factors have been suggested as responsible for thermostability: (i) the presence of large networks of ion pairs at the protein surface (1, 8 -12), (ii) protein loop shortening (7), and (iii) an increase of protein compactness (7,12). About a dozen additional determinants have also been mentioned, but do not appear relevant to all the structures elucidated (3-6, 8, 13). Recently, it as been proposed that adaptation to high temperature might involve different mechanisms in moderate thermophiles versus extreme thermophiles (12). This follows from a markedly different temperature dependence of the hydrophobic interactions, optimal around 70°C (14), compared with coulombic interactions, optimum at higher temperatures, up to 100°C (9). Moderate thermophiles and extreme thermophiles should therefore be treated separately for the analysis of high temperature adaptation factors. Apparent discrepancies in the results of previous works might originate in a failure to do so.
The availability of an increasing number of complete genomic sequences makes global proteome analysis a powerful tool for establishing genome-function relationships. In this respect, some differences were already noted between the amino acid compositions of subsets of ORFs 1 from mesophiles and hyperthermophiles (15)(16)(17)(18). However, previous approaches used comparisons of orthologous sequences or of structures of proteins belonging to a same family (5)(6)(7)(8). The x-ray structures were also selected according to quality criteria. Applying these two criteria lead to a reduced number of structures analyzed and might yield to a statistically less satisfactory approach. Here, we postulated that the most basic features of protein structures, amino acid composition and localization, should be imprinted in the genome of organisms with very different lifestyles. We therefore expected that global differences should be readily detectable from a coarse but comprehensive and unbiased analysis. Here, we computed the amino acid compositions in the ORFs of all complete genome sequences available to date (30) as well as on the protein surfaces (water-accessible surface (WAS), Ref. 19) of a large set of protein structures. Although these analysis were unbiased, since no prior selection of sequences or structures was performed, they exhibit a remarkable correlation, suggesting that the replacement of polar noncharged residues by charged ones constitutes a major stabilization mechanism in the proteins of hyperthermophilic organisms.

MATERIALS AND METHODS
Data on the amino acid composition of the fully sequenced genomes were taken from the web site of Fred Tekaia, Pasteur Institute (Paris). Organisms have been classified according to the three major domains of life and using the following abbreviations (which are used in Figs

RESULTS AND DISCUSSION
Amino Acid Composition of the Proteomes-Our data set consists of the amino acid compositions of the entire proteomes of 22 mesophiles, 1 thermophile, and 7 hyperthermophiles. We calculated the average proportion of each amino acid in all mesophiles (30 -50°C) on one hand and in all hyperthermophiles (Ͼ80°C , Table I) on the other hand. The only moderately thermophile organism genome available, that of M. thermoautotrophicum (Table I), was excluded from the comparisons for the reasons mentioned above (except for Figs 1 and 2A). We then grouped the results for each amino acid in classes such as charged, polar noncharged, aliphatics, and aromatics, a procedure based on biophysical characters and already used in this kind of approach (5,12). A benefit of this procedure is also to correct the bias in the proportion of certain amino acids induced by the strong variation in (G ϩ C) content across the compared genomes (Fig. 1). For instance, arginine content was found to be strongly correlated with the (G ϩ C) percentage, while the lysine content is strongly anticorrelated (Fig. 1). The use of Lys or Arg by a given organism does not depend on specific chemical properties of Arg versus Lys, but only depends on the (G ϩ C) content, as indicated by Fig. 1. Therefore, grouping Lys ϩ Arg makes it possible to overcome the (G ϩ C) content bias. As a consequence, the variations observed for the positively charged residues (Lys ϩ Arg) stay within 50%, while up to 5-fold variations are observed for the individual amino acids (Fig. 1). It is worth to notice that the (G ϩ C) content is not correlated to the optimal growth temperature of the organisms (Fig. 1).
The proteome analysis in amino acid classes indicated that hyperthermophilicity is characterized by a sharp increase of charged residues, Lys and Glu, at the expense of polar noncharged residues, mainly Gln ( Fig. 2A). This correlation with growth temperature is verified across the three kingdoms ( Fig.  2A). A slight increase in medium and large aliphatics is also observed, while the percentage of the short aliphatic alanine is lower (Fig. 2B). No significant variations are observed in aromatics or in other residues such as His, Pro, Gly, or Cys. In summary, the difference between charged and polar noncharged amino acids (Ch-Po) is the best indicator of the organism's lifestyle (Fig. 2, A-C). A threshold is observed around 10% for the extreme thermophiles Ch-Po values, while a higher limit of 5% characterizes the mesophiles. The moderate thermophile M. thermoautotrophicum is situated in between these two values ( Fig. 2A).
The Case of A. pernix and P. horikoshii-Our initial analysis of the A. pernix proteome revealed a discrepancy: a Ch-Po value of only 5.2% was computed for this organism, thus in the mesophile range and much below the 10% lower limit inferred from the other true hyperthermophiles (Table I). This conflicting result prompted us to reassess the quality of the ORF annotation originally proposed for this genome. According to the original publication (20) A. pernix was found to exhibit 2694 ORFs for a total genome length of 1,669,695 base pairs, thus corresponding to 1 ORF for 620 nucleotides in average. This theoretical gene density is almost twice as much as previously reported for any other micro-organism sequenced so far. The fact that this organism was also claimed to exhibit the largest proportion of ORFs without data base similarity (57.1%) suggested an anomaly in the genome annotation. The whole A. permix ORF data set was then downloaded from the official site at the Japan National Institute for Technology Evaluation for further study. It soon appeared that the total number of nucleotides included in the protein coding fraction of the genome (1,916,052) was much greater than the total number of nucleotides (1,669,695). This anomaly is mostly due to the annotation of overlapping ORFs in different reading frames, resulting in a significant overprediction of genes and the pollution of our computations by illegitimate protein sequences.
We first corrected this problem by limiting our amino acid composition analyses to the subset of 628 predicted ORFs for which clear homologues have been recognized in other organisms ("Annotable ORFs," according to Ref. 20). This subset is less likely to contain illegitimate ORFs. Independently, we also reannotated A pernix genomic sequence using the iterative Markov model method SelfID (21) and recomputed the amino acid composition on the predicted coding regions longer than 300 nucleotides. The amino acid compositions obtained by these two independent ways are virtually identical, and as expected, very different from the composition computed on the original data set. Using the two corrected amino acid compositions, our previously defined indicator of thermophilicity Ch-Po was 10.6%, compatible with the values observed for the other hyperthermophilic organisms.
After the above correction for A. pernix, P. horkoshii became the hyperthermophile organism associated with the lowest  annotated according to a similar protocol (23). P. horkoshii is also characterized by a very large fraction of ORFs without data base similarity (44.4%) and a unusually high gene density with predicted protein coding nucleotides representing 98.5% of the total genome. We thus suspected that the same overprediction plaguing A. pernix could affect P. horikoshii, although to a lesser extent. As expected, restraining the amino acid composition analysis to the subset of 557 predicted ORFs with recognized homologs resulted in a corrected C-P value of 13.6%, almost identical to that of P. abyssii. The average values presented in the plots of Fig. 2 include these corrections.
In the paper by Audic and Claverie (21) 10 bacterial genomes were checked using the described method. There were very small differences between the published data and the reanalyzed one. After the discovery of the wrong annotations of the Japanese institute, we reanalyzed the 13 remaining genomes, and we found, as in the first 10, small differences or none.
The Water-accessible Surfaces of Amino Acid Classes-The proportion of polar noncharged amino acids is greatly decreased in the extreme thermophile organisms and partially compensated by charged amino acids ( Fig. 2A). Where are these amino acids located in the proteins structures? Are the variations observed in the genomes correlated with differences at the protein surfaces? To address these questions, we calculated the WAS of the various amino acids for 131 proteins from three mesophilic bacteria and for 58 proteins from hyperthermophilic bacteria or archaea (Fig. 3). The WAS values were then grouped by classes of amino acids, as defined above for the genomic analyses. The most striking difference between the WAS values was again found between charged and polar noncharged amino acids (Fig. 3A). The WAS percentage of charged amino acids increased from 22.5 in mesophiles to 27 in hyperthermophiles (ϩ20%), mainly due to lysines and glutamic acids. This difference (ϩ4.5%) was almost fully compensated for by the decrease of the WAS percentage (Ϫ3.5%) of polar residues in mesophiles (Fig. 3B). Glutamine exhibited the largest variation among all residues, with a WAS decreased by more than 2-fold. WAS values for the other polar residues exhibited the same trend, to a lesser magnitude. The WAS percentage for aliphatic residues showed little change at 6.0 -6.5% of the total surface area. In contrast, alanines are decreased by 2-fold, and histidines by 1.5-fold (Fig. 3A). Finally, the exposed area of Gly, Cys, and aromatic residues was found to be unchanged between mesophiles and hyperthermophiles. The differences observed in the above average values are comparable with those computed between individual sets of proteins from the three mesophilic or the three hyperthermophilic organisms, reinforcing their significance (Fig. 3B). Thus, our results indicate that the large differences in surface accessibilities of amino acid classes are correlated with the large variations observed in the overall proteome abundance between the two sets of organisms.
Concluding Remarks-Coarse but extensive approaches at the genome or structure level are thus able to identify discriminant factors between mesophiles and extreme thermophiles. This implies the existence of global structural features associ-ated with hyperthermostability common to thermophilic bacteria and archaea. The equal increase of subclasses of oppositely charged residues (mostly Lys, Arg, and Glu) in hyperthermophiles most likely derives from the increased amount of ion pairs observed at the surface of their proteins. This is the genomic signature of a thermodynamic advantage resulting from the increased stability of coulombic interactions with temperature (itself due to the decrease of water dielectric constant, Ref. 9). Furthermore, in most structures of hyperthermophilic proteins, the existence of long chains of ion pairs providing cooperative stabilization has been demonstrated (1-3).
Interestingly, the residues the proportion of which exhibit the largest relative decreases in the genomes of hyperthermophiles are Asn and Gln, already known to be the most temperature sensitive. It is tempting to relate the observed Glu/Gln balance to a peculiarity in the metabolism of hyperthermophiles. For instance, the Gln tRNA synthetase gene is missing in the bacteria A. aeolicus (18) and the archaea M. janaschii (24). In both organisms, Gln originates from the amidation of glutamic acid loaded on the Gln tRNA. Conversely, a completely different class of Lys tRNA synthetase gene has been found to exist in some hyperthermophiles, where the proportion of lysine is much increased. Missing, different, or more efficient aminoacyl tRNA synthetases may contribute to various adaptive mechanisms and play a crucial and general role in the pathway to hyperthermostability.