Molecular Recognition in the Assembly of Collagens: Terminal Noncollagenous Domains Are Key Recognition Modules in the Formation of Triple Helical Protomers*

The α-chains of the collagen superfamily are encoded with information that specifies self-assembly into fibrils, microfibrils, and networks that have diverse functions in the extracellular matrix. A key self-organizing step, common to all collagen types, is trimerization that selects, binds, and registers cognate α-chains for assembly of triple helical protomers that subsequently oligomerize into specific suprastructures. In this article, we review recent findings on the mechanism of chain selection and infer that terminal noncollagenous domains function as recognition modules in trimerization and are therefore key determinants of specificity in the assembly of suprastructures. This mechanism is also illustrated with computer-generated animations.

The assembly of all collagen suprastructures begins with the association of three type-specific ␣-chains (trimerization) that subsequently intertwine to form triple helical protomers, the building blocks of larger assemblies (1)(2)(3)(4)(5). Protomers are homotrimers or heterotrimers, composed of up to three different ␣-chains. They have in common at least one triple helical collagenous domain of varying length and two noncollagenous domains (NC) of variable sequence, size, and shape that are positioned at the N and C termini, designated herein as N-NC and C-NC domains, respectively. The terminal NC domains are excised, modified, or incorporated directly into the final suprastructure, depending on protomer type and function. Subsequently, specific protomers oligomerize into distinct suprastructures involving interactions that form end-to-end connections, lateral associations, and supercoiling of helices. Thus, protomer formation and oligomerization involve pivotal recognitions steps that target specific ␣-chains to assemble into a particular type of suprastructure.
How the ␣-chains selectively recognize each other is a fundamental question in matrix biology that remains largely unanswered. Early studies on collagens I and III suggested that the C-NC domains play a critical role in the trimerization step that involves selection, binding, and registration of three ␣-chains (6 -9). In this article, we review recent findings on the trimerization of collagens, with an emphasis on collagens I and IV, which are the best understood.

Fibril-forming Collagens
Fibril-forming collagens include the widely distributed types I, III, and V, types II and XI, prominent in cartilage and eye, and the newly discovered types XXIV and XXVII (10 -12). All of the 12 fibril-forming ␣-chains share a long uninterrupted collagenous domain flanked by N-and C-terminal noncollagenous propeptides. The ␣-chains assemble into at least 12 type-specific protomers, characterized as homo-and heterotrimers. The terminal propeptides are further cleaved by specific proteases, which promote oligomerization and fibril formation.
Type I collagen is composed of ␣1and ␣2-chains, forming preferably an ␣1⅐␣1⅐␣2 heterotrimer (Fig. 1). However, if only the triple helical domains are considered, the (␣1) 3 homotrimer would be the preferred form (13,14). The pro-␣1 homotrimer is formed as the default structure in the absence of pro-␣2-chain, whereas the opposite is not true (15). However, if the ␣2 C-NC domain is replaced with an artificial trimeric NC domain, an ␣2 homotrimer can form (16,17). Thus the ␣2 C-NC domain is the key domain required for heterotrimer assembly, although the ␣1 C-NC domain contains sufficient information for directing homotrimer formation. Disulfide bonding in the ␣2 C-NC domain is crucial for heterotrimer formation (18), particularly within the most C-terminal sequence of the ␣2 C-NC domain (19,20).
The putative conformation and energy-minimized structures of the human ␣1 C-NC and ␣2 C-NC reveal (21,22) that each contains five subdomains ( Fig. 1A) (8,23). Domain I folds into subdomains Ia and Ib without ␣-helical or ␤-sheet conformations, inconsistent with suggestions that subdomain Ia participates in trimerization by forming ␣-helical coiled-coils (23)(24)(25). Subdomain Ib contains the interchain disulfide bonds in the assembled C-NC trimer. Domains II and IV fold into globular regions G 1 and G 2 , respectively. These are linked by an antiparallel ␤-sheet assembled from domains III and V (Fig. 1).
Energy minimization and molecular dynamics simulations have revealed key steps in the folding, docking, and assembly of two ␣1 C-NC (␣1 1 and ␣1 2 ) and ␣2 C-NC domains into a heterotrimer (26). Folding is initiated at the C terminus by bimolecular association of the ␣2 and ␣1 2 , near the junction of domain III and domain IV-G 2 , and proceeds toward the N terminus (Fig. 1B). The ␣2 domain V appears to have a dominant effect on heterotrimer formation (19,20). Trimerization proceeds via a second interaction between the ␣2-␣1 2 dimer and the ␣1 1 at domain II-G 1 after which the interchain disulfide bonds become established in domain Ib. Folding of domain Ia is the last and slowest folding step but eventually drives the folding through the C-telo-Ia junction (26,27). Overall, the ␣2 C-NC domain appears to provide the driving force for heterotrimer formation. Computer modeling of collagen I C-propeptide (26) also suggests that the homotrimer could resemble a cruciform as seen in the collagen III C-propeptide (28).
Putative trimerization control sequences have been located within the C-NC domains (17,29). Replacing the ␣2(I) C-NC domain of the ␣2(I)-chain with the homotrimer-directing ␣1(III) C-NC permitted ␣2⅐␣2⅐␣2 homo-trimerization (17,29)), demonstrating that the ␣1 C-NC domain is sufficient to direct homotrimer assembly of ␣2-chains with aligned triple helices. Exchanging specific sequences from the ␣1 C-NC domain of collagen III with the corresponding sequences of the ␣2 C-NC of collagen I revealed the location of a putative recognition site, a discontinuous sequence of 15 amino acids, located at the C-terminal of subdomain III (Fig. 1). This ␣1 sequence, when transferred to the sequence-equivalent region of the ␣2 C-NC domain, enabled the ␣2-chains to homotrimerize (17,29). Thus, this recognition site is both necessary and sufficient to ensure that collagen chains discriminate between each other to form type-specific protomers (Fig. 1, A and C).

Network-forming Collagens
The network-forming collagens include types IV, VIII and X. In contrast to the fibril-forming collagens, their C-NC domains are retained in their suprastructures. Importantly, crystal structures of their C-NC domains are known, which provide insight into the mechanisms of chain selection.
Collagen IV-Type IV collagen is the major constituent of basement membranes. It is composed of a family of six homologous ␣-chains (␣1-␣6), each characterized by a long collagenous domain of ϳ1400 residues, interrupted by about 20 short noncollagenous sequences, which is flanked by a short N-NC sequence of 25 residues and a globular C-NC domain of 230 residues. The six ␣-chains assemble into three heterotrimeric protomers (␣1⅐␣1⅐␣2, ␣3⅐␣4⅐␣5, and ␣5⅐␣5⅐␣6), which further assemble into three distinct networks (30).
Collagen IV has provided a unique opportunity to gain insight into chain recognition mechanisms. First, the six ␣-chains assemble into 3 specific protomers out of 76 possible combinations, reflecting a remarkable specificity for chain selection in vivo (30,31). Second, the extraordinary capacity of monomeric C-NC domains to recognize each other and reassemble in vitro into their original trimeric and hexameric compositions (32), together with their critical role in trimerization in vivo (33), indicates a recognition role in assembly and provides the means to characterize specificity and binding parameters (34). Third, the crystal structure of the C-NC hexamer is known (31,35), which reveals the nature of intermolecular interactions and their role in chain selectivity. Fourth, the known primary structure for all six C-NC domains from many species provides information needed to deduce candidate recognition sites from sequence variation (34). Finally, a study of sequence variation, coupled with structural and kinetic analysis, provides the requisite information needed to decipher the location and three-dimensional features of putative recognition sites (34).
Two putative recognition sites were identified in the C-NC domains as prominent mediators in the mechanism of chain recognition (34). These sites (Fig. 2D) constitute a 13-residuelong ␤-hairpin motif (site 1) involved in the domain swapping mechanism and a 15-residue-long variable region (site 2, VR3, docking site) with sequence hypervariability across all the chains of collagen IV (34). Accordingly, we have proposed that the combination of sequence variations at these two sites and at the neighboring C-NC domains to which they interact consti- tutes the code that directs chain recognition in the trimerization of collagen IV (34). The relatively higher affinity of ␣2 C-NC domain for dimer formation with an ␣1 C-NC domain is the driving force in directing stoichiometry in the assembly of ␣1⅐␣1⅐␣2 heterotrimers, as determined by kinetic studies using surface plasmon resonance (34).
Collagen VIII-Type VIII collagen is a prominent component of Descemet membrane and in the subendothelium of vascular walls (36). It is composed of highly conserved ␣1(VIII)and ␣2(VIII)-chains, which contain a short collagenous domain of 454 residues flanked by a N-NC domain of 117 residues and a C-NC domain of 173 residues (37). The chains assemble into two distinct homotrimers that assemble into hexagonal lattices (38,39). The crystal structure of the ␣1 C-NC homotrimer is known (40), but the basis for the preferred homotrimer structure remains unclear.
Collagen X-Type X collagen is expressed in the endochondral growth plate. The single ␣1-chain contains a short collagenous domain of 154 residues, flanked by a N-NC domain of 37 residues and a C-NC domain of 161 residues (41). Its protomer is a homotrimer. The C-NC domains function as nucleation sites for trimer and multimer formation, based on experiments using recombinant C-NC domains (42). A comparison of the x-ray crystal structures of the C-NC trimer of collagens X and VIII revealed them to be highly homologous, but they differ in intermolecular contact residues that govern trimer stability. Such differences may also confer specificity (40,43).

FACIT Collagens
The FACIT collagens include types IX, XII, XIV, XVI, XIX, XX, XXI, and XXII. Collagen IX is composed of three ␣-chains and all others of one ␣-chain, each of which is characterized by short collagenous domains interrupted by several NC domains (1,2). The protomer of collagen IX is a heterotrimer, and all others are homotrimers, accounting for at least 9 distinct protomers. Unlike the fibril-forming collagens, the FACITs have significantly shorter C-NC domains: 75 residues for collagen XII and fewer than 30 residues for collagen IX, whereas those of fibrillar collagen are about 260 residues. The FACITs share a remarkable sequence homology at their COL1/C-NC1 junctions by having two strictly conserved cysteine residues separated by four residues at the NC domain. Several lines of evidence suggest that the COL1 domain at the COL1/C-NC junction (which contains the first cysteine residue) and the C-NC domain (containing the second conserved cysteine) are involved in the mechanism of chain selection in the assembly of collagens XII and XIV (44 -46). Independent studies by Mazzorana et al. (44,45) and Lesage et al. (46) strongly suggest that both the collagenous portion and the NC domain are involved in formation and stabilization of protomers in collagens XII and XIV (47).

Beaded Filament and Anchoring Fibril Collagens
Collagen VI-Type VI collagen is ubiquitously expressed in connective tissues. It is composed of three genetically distinct ␣-chains, ␣1, ␣2, and ␣3, that form extensive beaded filaments (48 -50). Each chain contains a relatively short collagenous domain of ϳ335 residues flanked by N-NC and C-NC globular domains The ␣1and ␣2-chains are similar in size and contain one N-NC domain and a C-NC domain with two subdomains, C1 and C2. In contrast, the ␣3-chain is much longer, and the N-NC domain contains ten subdomains, N1-N10, and five C-NC subdomains, C1-C5 (48). The protomer is an ␣1⅐␣2⅐␣3 heterotrimer. Studies conducted on cell lines lacking or expressing different combinations of recombinant ␣-chains have shown that the ␣3-chain contains sequences necessary for chain association and protomer formation (48,50). Deletion studies have demonstrated that whereas the C5 subdomain of the ␣3-chain is required for the extracellular microfibril formation, the C1 subdomain, illustrated in Fig. 3, in all chains is FIGURE 2. Protomer assembly of collagen IV. Assembly is initiated by specific interactions of the C-NC domains (A), after which folding of the triple helical domain proceeds toward the amino terminus, forming a heterotrimeric protomer. Dimerization of protomers occurs via the equatorial faces (A and B) of the NC trimers, yielding NC hexamers that are stabilized by local and extensive hydrophobic and hydrophilic forces. Within each NC trimer, the monomers recognize one another through a domain-swapping mechanism in which a ␤-hairpin motif (Site 1) of one monomer is swapped into a docking site (Site 2) of its swapping partner (C and D). Dimers also oligomerize at their N termini forming network suprastructures. Assembly is also illustrated by a computer-generated animation available at www.mc.vanderbilt.edu/cmb/ collagen. sufficient for chain recognition and protomer assembly (48,50).
Collagen VII-Type VII collagen is a major component of anchoring fibrils that maintain the epidermal-dermal adherence of skin. It is composed of three identical ␣-chains, each consisting of a collagenous domain of 1530 residues with 19 interruptions, which are flanked by N-NC and C-NC domains (51). Protomers form anti-parallel dimers that are stabilized by disulfide bonding via the C-NC domains of adjacent protomers. Following proteolytic modification of the C-NC domain, dimers of protomers oligomerize laterally to form anchoring fibrils with large globular N-NC domains at both ends of the microfibril (52). Recent studies, using recombinant constructs and site-directed mutagenesis, have revealed that the C-NC domain mediates protomer formation and oligomerization of protomers (52).

Transmembrane Collagens
Transmembrane collagens include types XIII, XVII, XXIII, and XXV and other collagen-related proteins, such as the macrophage receptor MARCO (53). They function as cell surface receptors and matrix components (54). The ␣-chain of each type contains an N-terminal NC domain, which is comprised of (a) three subdomains, an intracellular, a single transmembrane, and an extracellular juxtamembrane linker subdomain and (b) a large extracellular domain, which is composed of multiple collagenous domains interrupted by NC domains. The protomer of each ␣-chain is a homotrimer. The extracellular linker subdomain contains an ␣-helical coiled-coil, the conformation of which is thought to prompt trimerization and subsequent zipper-like folding of the triple helical domain. Studies using deletion constructs have shown that the N-NC domains of collagens XIII and XVII are necessary for triple helix formation, suggesting that nucleation of the triple helix occurs at the N-terminal region and proceeds in a N-to C-terminal direction, which is opposite to that of all other classes of collagens (55,56). For collagen XIII, a short sequence of 21 residues located at the juxtamembrane linker subdomain, has been identified as a putative recognition site for trimerization (Fig. 3).

Other Collagens
Little is known about the assembly of collagen XV (57) and XVIII (58) and the newly discovered XXVI (59) and XXVIII (4). Each type is composed of a single ␣-chain that contains a collagenous domain, with frequent interruptions, flanked by N-NC and C-NC domains. The C-NC domain of XVIII forms trimers in vitro suggesting a role in protomer assembly (60).The crystal structures of fragments of C-NC domains of XV and XVIII are highly similar but distinct (61,62), and they differ greatly with the three-dimensional structures of the C-NC domains of collagens VIII, X, and IV (see above); such distinctions are consistent with a special role of C-NC domains in chain selection.

Conclusions and Future Perspectives
Over the last decade, findings have emerged on several collagen types that provide new insights into the chain selection mechanism. Based on the body of evidence reviewed here, we infer that the terminal NC domains function as recognition modules that select, bind, and register three cognate ␣-chains for self-assembly of triple helical protomers. The C-NC domains govern chain selection for all classes except for the transmembrane collagens, which are governed by the N-NC domains (Fig. 3). As recognition modules, they contain sites that are characterized by positive determinants of shape complementarity, electrostatic charge distribution and magnitude, and hydrophobicity for selection of cognate ␣-chains and negative determinants for the exclusion of all other unrelated ␣-chains.
Following trimerization, coupled with triple helix formation, new recognition sites are encoded, a consequence of forming quaternary structures that commit protomers to oligomerize with cognate protomers into a particular type of suprastructure such as a fibril or network. The quaternary recognition sites specify such features as enzymatic cleavage sites, end-to-end connections, lateral associations, supercoiling of helices, and cross-link formation, as well as interactions with other macromolecular complexes such as proteoglycans (63) and integrin receptors (64). Therefore, by directing formation of triple helical protomers with distinct chain compositions and quaternary structures, the NC recognition modules are key determinants of specificity that target a type of ␣-chain to assemble into distinct suprastructures.
Many important features remain to be elucidated about the NC recognition mechanisms. The C-NC modules of heterotri- mers composed of three different chains (collagens IV, V, VI, and IX) are the most promising models for study because their trimeric forms display quaternary structural features that can be differentially analyzed with regard to the identity of recognition motifs and the cyclic order of chains in the triple helix. Such mechanistic information provides opportunities for engineering new suprastructures for biomaterials and development of therapeutic strategies that facilitate or interfere with collagen assembly in various disorders.