Genome-based Identification and Analysis of Collagen-related Structural Motifs in Bacterial and Viral Proteins*

Collagens are extended trimeric proteins composed of the repetitive sequence glycine- X-Y . A collagen-related structural motif (CSM) containing glycine- X-Y repeats is also found in numerous proteins often referred to as collagen-like proteins. Little is known about CSMs in bacteria and viruses, but the occurrence of such motifs has recently been demonstrated. Moreover, bacterial CSMs form collagen-like trimers, even though these organisms cannot synthesize hydroxyproline, a critical residue for the stability of the collagen triple helix. Here we present 100 novel proteins of bacteria and viruses (including bacteriophages) containing CSMs identified by in silico analyses of genomic sequences. These CSMs differ significantly from human collagens in amino acid content and distribution; bacterial and viral CSMs have a lower proline content and a preference for proline in the X position of G XY triplets. Moreover, the CSMs identified contained more threonine than collagens, and in 17 of 53 bacterial CSMs threonine was the dominating amino acid in the Y position. Molecular modeling suggests that threonines in the Y position make direct hydrogen bonds to neighboring backbone carbonyls and thus substitute for hydroxyproline in the stabilization of the collagen-like triple-helix of bacterial CSMs. The majority of the remaining CSMs were either rich in proline or rich in charged residues. The bacterial proteins containing a CSM that could be energy minimization con-straint. model conformations all side systematic search molecular mechan- calculations statistical

Collagens are extended trimeric proteins composed of the repetitive sequence glycine-X-Y. A collagen-related structural motif (CSM) containing glycine-X-Y repeats is also found in numerous proteins often referred to as collagen-like proteins. Little is known about CSMs in bacteria and viruses, but the occurrence of such motifs has recently been demonstrated. Moreover, bacterial CSMs form collagen-like trimers, even though these organisms cannot synthesize hydroxyproline, a critical residue for the stability of the collagen triple helix. Here we present 100 novel proteins of bacteria and viruses (including bacteriophages) containing CSMs identified by in silico analyses of genomic sequences. These CSMs differ significantly from human collagens in amino acid content and distribution; bacterial and viral CSMs have a lower proline content and a preference for proline in the X position of GXY triplets. Moreover, the CSMs identified contained more threonine than collagens, and in 17 of 53 bacterial CSMs threonine was the dominating amino acid in the Y position. Molecular modeling suggests that threonines in the Y position make direct hydrogen bonds to neighboring backbone carbonyls and thus substitute for hydroxyproline in the stabilization of the collagen-like triple-helix of bacterial CSMs. The majority of the remaining CSMs were either rich in proline or rich in charged residues. The bacterial proteins containing a CSM that could be functionally annotated were either surface structures or spore components, whereas the viral proteins generally could be annotated as structural components of the viral particle. The limited occurrence of CSMs in eubacteria and lower eukaryotes and the absence of CSMs in archaebacteria suggests that DNA encoding CSMs has been transferred horizontally, possibly from multicellular organisms to bacteria.
Collagens, present in most multicellular organisms, are helical proteins composed of three extended polyproline type IIlike chains (1). Best studied are the fibrillar collagens, constituting important components of the extracellular matrix.
However, non-fibrillar collagens as well as proteins with shorter collagen-like regions also exist, and a collagen-related structural motif (CSM) 1 has been identified in proteins of bacteria, bacteriophages, and viruses (2)(3)(4)(5). Bacteria and viruses are generally believed to be unable to synthesize hydroxyproline, a residue regarded as essential for the stabilization of a triple-helical structure. However, the absolute requirement of hydroxyproline for triple-helix formation has lately been challenged (6,7), and alternative means for the stabilization of triple-helical collagen have been proposed (8,9). Importantly, it has been shown that three different bacterial CSMs trimerize, despite the lack of hydroxyprolines (2,10).
This study was undertaken to investigate the occurrence of proteins containing CSMs in bacteria and viruses through an in silico approach. We found that CSMs are encoded by a minority of bacteria and bacteriophages and that these CSMs differ from human collagens in several important aspects. A novel mechanism for the stabilization of bacterial CSMs is presented. We also propose that horizontal gene transfer has contributed to the evolution of CSMs.

EXPERIMENTAL PROCEDURES
The CSMs were identified by downloading 56 bacterial proteomes from The Institute of Genomic Research (TIGR.org) homepage using the batch download function and pattern searches in-house, using MacVector (version 6.5.3; Oxford Molecular Ltd.). The pattern used was (GPP) 7 with 12 allowed mismatches. Matches containing Gly in the Pro positions were excluded. The CSMs were then used in unfiltered tBLASTn searches from the NCBI microbial genome homepage against 136 eubacterial genomes, 15 archeabacterial genomes, 30 genomes of lower eukaryotes, and against available viral genomes. Obtained hits were analyzed with the pattern in-house. Open reading frames flanking the bacterial CSM-encoding genes were analyzed by BLASTp to determine whether the gene was phage-encoded.
Homology models of proline-rich CSMs were constructed on a trimeric ((GPP) 10 ) 3 template from a 1.3-Å x-ray structure (11). The models were built by global energy minimization using the template as constraint. After the model structures had been minimized, conformations of all threonine side chains were sampled through an exhaustive systematic search of the -angles of the side chains. All molecular mechanics calculations were carried out using the ICM software (12). All statistical analyses for differences were performed using Fisher's exact test.

RESULTS AND DISCUSSION
Proteins with Collagen-related Structural Motifs in Bacteria and Viruses-The characteristic primary structure of collagen (glycine-X-Y repeats) and the high content of proline allowed us to construct a simple pattern (GPP) 7 for in silico detection of CSMs. The pattern was applied to 58 bacterial proteomes to detect such sequences. In addition, BLAST searches against 137 eubacterial genomes, 15 archaebacterial genomes, 30 ge-* This work was supported by grants from the Swedish Research Council (projects 7480 and 14379), the Medical Faculty, Lund University, the Foundations of Kock and Ö sterlund, and Hansa Medical AB. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
nomes of lower eukarya, and against viral genomes with the CSMs obtained in the proteome screen, identified additional proteins containing CSMs. In total, 103 CSMs were identified. The results, which are summarized in Fig. 1A, show that proteins with a CSM are present in a minority of genomes analyzed. No CSMs were detected in the 15 archaebacterial genomes, whereas 53 CSMs were detected in the 137 eubacterial genomes analyzed. In eubacteria, CSMs are mostly found in the firmicute group (including mycobacteria and Gram-positive bacteria), and on several occasions a single genome encodes more than one such protein. The viral CSMs detected were mostly from bacteriophages (37/43), and some of them represent allelic variants. A given viral or bacteriophage genome only encodes one CSM. In the lower eukarya group, CSMs were only detected in proteins from three different strains of Plasmodium falciparum. These CSMs were omitted from further analysis because they were too few to allow any statistical analysis.
Characteristics of Bacterial and Viral CSMs-The CSMs were generally dissimilar in primary structure, except for the periodicity of the glycines and a relatively high proline content. The length of the CSM varied from 7 (the detection limit of the pattern) to 745 continuous GXY repeats, with a mean number of 76 GXY repeats. In 18 of the 103 CSM-containing proteins identified, more than one CSM was identified. Both COOHand NH 2 -terminal to the CSM, other regions of varying length and sequence were found in all proteins. A position-specific analysis of the amino acid content of the identified CSMs revealed several differences compared with human collagens (Fig. 1B). Because no statistically significant differences were noted between bacteriophage CSMs and CSMs from other vi-ruses, these were pooled for the purpose of subsequent analyses. Perhaps most striking is the difference in proline content and the preference for proline in position X in the CSMs from bacteria and viruses. The proline content is significantly lower in these CSMs as compared with human collagens (p ϭ 3.7 ϫ 10 Ϫ147 for the bacterial CSMs and p ϭ 2.8 ϫ 10 Ϫ16 for the viral CSMs). In bacterial and viral CSMs the fraction of prolines found in the Y position was significantly lower as compared with this fraction in human collagens (p ϭ 3.2 ϫ 10 Ϫ77 for bacterial and p ϭ 1.2 ϫ 10 Ϫ59 for the viral CSMs). Prolines in position Y are generally hydroxylated in human collagen; thus, the relative absence of prolines in this position among CSMs from bacteria and viruses probably reflects their inability to synthesize hydroxyproline. In this context it should be mentioned that bacterial homologues of eukaryotic prolyl hydroxylases could not be identified despite extensive similarity searches.
The CSMs identified contained a significantly higher proportion of threonine (p ϭ 7.6 ϫ 10 Ϫ206 for bacterial CSMs and p ϭ 1.9 ϫ 10 Ϫ36 for viral CSMs) and glutamine (p ϭ 2.1 ϫ 10 Ϫ12 for bacteria and p ϭ 4.1 ϫ 10 Ϫ14 for viruses) in the Y position as compared with human collagen. A minority of the bacterial CSMs (17 of 53) had more than 50% of Y positions occupied by threonine (mean 94%). In these threonine-rich CSMs, the X position is typically occupied by prolines, alanines, or serines, whereas charged residues are less common than in the other bacterial CSMs and in human collagen. The proteins containing threonine-rich CSMs are clustered in five bacterial species: the spore-forming human pathogens Clostridium difficile, Bacillus anthracis, and Bacillus cereus, the nitrogen-fixating Metorhizobium loti, and the sulfur-metabolizing Desulfitobacter hafniense. To determine whether threonine can influence stability of a collagen-like triple-helix, we built homology models of six representative proline-rich CSMs on a trimeric ((GPP) 10 ) 3 template constructed from a 1.3-Å x-ray structure (11). The resulting models show that threonine in the Y position is able to form direct interchain hydrogen bonds to backbone carbonyls in its energetically most favored conformation (Fig. 2). When the amino acids in the X and Y positions of one threonine-rich CSM were switched, the threonines were not able to form interchain hydrogen bonds (data not shown). Interestingly, there are indications that threonine in the Y position can also stabilize a triple-helical structure through indirect hydrogen bonding or through glycosylations (8, 13, 14).
Most of the remaining 83 CSMs could instead be classified as rich in prolines (Ͼ30% in X and Y position) or charged (Ͼ45% charged residues in X and Y). Eight of the identified CSMs did not meet any of the criteria, whereas five CSMs fell into more than one class (Fig. 3A). Not only the prolines, but also the charged residues, were found to be unevenly distributed between the X and Y positions (Fig. 3B). Most often, negatively charged residues are found in position X, whereas positively charged residues (arginine and lysine) are found in the Y position. This may be related to triple-helix stability, because especially arginine in the Y position stabilizes trimers (9). Negative charges in the Y position are generally destabilizing, with the exception of GR/KD triplets (15). Among the charged CSMs with a net negative charge in the Y position, aspartate is more common than glutamate and the aspartate is in 50% of the cases preceded by a positively charged residue. In addition to stabilization, charges in collagen are important for the interactions between collagen and other macromolecules (16).
Function of the CSM-The function of CSMs in bacteria and viruses is not well understood. The few CSMs studied have been shown to mediate trimerization or elongation (2, 10, 17, 18) of proteins found at bacterial or bacteriophage surfaces (2, 18 -25). Furthermore, the CSM of one protein from Streptococcus sanguis has been shown to mediate aggregation of platelets (26), although it is less clear whether the CSMs from two Streptococcus pyogenes proteins contribute to the adhesive properties of these proteins (21,24). The viral proteins containing CSMs identified in this study were mainly encoded by bacteriophages and had to a large extent been annotated or could easily be annotated by similarity searches. The majority of these proteins were tail-fiber or host-specificity proteins, involved in the interactions between bacteriophages and bacteria (see e.g. Ref. 27). Only a minority of the bacterial proteins containing a CSM could be annotated. Three proteins had previously been described as surface or spore-associated (20 -25); from pattern and similarity searches, an additional 13 proteins could be classified as cell wall-attached (28) or transmembrane proteins. The non-collagen-related parts of the proteins showed surprisingly little sequence similarity to other proteins, making a functional classification of the bacterial proteins difficult. The presence of CSMs in proteins with so little sequence similarity indicates that the CSM is, indeed, a structural motif that elongates or trimerizes a variety of different, mainly extracellular, proteins. The possibility remains, however, that some CSMs may mediate binding of bacterial or bacteriophage proteins to other macromolecules.
Evolutionary Aspects-Because so few bacterial and unicellular eucaryotic genomes encode CSMs and because archaebacteria completely lack such sequences, it seems unlikely that CSMs have evolved before the diversification of archaea, bacteria, and eukaryotes. It appears more plausible that genes encoding CSMs have moved horizontally or arisen on several occasions during evolution rather than selective gene loss among archaea, bacteria, and lower eukaryotes. Horizontal gene transfer between bacteria and vertebrae has been proposed to be relatively common (29,30), although this view has also been questioned (31). If horizontal gene transfers of CSMs have occurred, we believe that the direction of transfer has been from eukaryotes to bacteria. After horizontal transfer of sequences encoding a CSM, bacteriophages may have promoted further horizontal transfer within the bacterial kingdom leading to the "patchy" distribution seen for CSMs among bacteria. We have no evidence for a role of viruses or bacteriophages in the putative horizontal transfer between eukaryotes and bacteria, but this cannot be excluded. After the putative horizontal transfer, extensive rearrangements seem to have occurred in the bacterial CSMs, because there are limited sequence similarities except for the spacing of glycines in the bacterial CSMs. The limited sequence similarities also made phylogenetic analyses meaningless. Interestingly, the threonine-rich CSMs are very dissimilar in amino acid composition from human collagens and from the other bacterial CSMs. Instead, the threoninerich CSMs have their closest homologues in hypothermal vent worm cuticle collagens (14). This suggests that several events of horizontal gene transfer of the genetic material encoding CSMs could have occurred during evolution. An alternative evolutionary explanation to the threonine-rich CSMs is that these have arisen de novo in bacteria, followed by horizontal spread to a few other bacterial species. The threonine-rich CSMs are highly repetitive and could have evolved from duplications of a short DNA segment.
The presence of CSMs in bacteria and viruses underlines the importance of collagen as a structural motif in nature. This work also suggests that alternative means for triple-helix stabilization probably operate in bacteria and viruses, and future elucidation of the structure and function of bacterial and viral CSMs represents an interesting scientific task.

FIG. 3. Amino acid composition of bacterial and viral CSMs.
A, the CSMs identified in bacterial and viral proteins (53 and 47, respectively) were categorized into three groups. The percentage of CSMs falling into each group is given. The threonine-rich CSMs have more than 50% of Y positions occupied by threonine and the proline-rich have more than 30% of X and Y positions occupied with proline, whereas charged CSMs have more than 45% charged residues in the X and Y positions. Some CSMs cannot be classified into one of these groups, whereas some meet more than one criterion. B, the skew in charge distribution between the X and Y position of GXY triplets from charged CSMs is shown. For every charged CSM, the net charge (K/R ϭ ϩ1, D/E ϭ Ϫ1) of residues in the Y position minus the net charge of residues in the X position was divided by the total number of charged residues. Thus, a CSM with all positively charged residues in the Y position and all negatively charged residues in the X position will obtain a value of ϩ1, whereas a CSM with all positively charged residues in the X position and all negatively charged residues in the Y position will obtain a value of Ϫ1. Each dot in the diagram represents one CSM.