Candidate Cell and Matrix Interaction Domains on the Collagen Fibril, the Predominant Protein of Vertebrates*

Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens and non-collagenous molecules into large cable-like fibrils, yet how the fibril interacts with cells and other binding partners remains poorly understood. To help reveal insights into the collagen structure-function relationship, a data base was assembled including hundreds of type I collagen ligand binding sites and mutations on a two-dimensional model of the fibril. Visual examination of the distribution of functional sites, and statistical analysis of mutation distributions on the fibril suggest it is organized into two domains. The “cell interaction domain” is proposed to regulate dynamic aspects of collagen biology, including integrin-mediated cell interactions and fibril remodeling. The “matrix interaction domain” may assume a structural role, mediating collagen cross-linking, proteoglycan interactions, and tissue mineralization. Molecular modeling was used to superimpose the positions of functional sites and mutations from the two-dimensional fibril map onto a three-dimensional x-ray diffraction structure of the collagen microfibril in situ, indicating the existence of domains in the native fibril. Sequence searches revealed that major fibril domain elements are conserved in type I collagens through evolution and in the type II/XI collagen fibril predominant in cartilage. Moreover, the fibril domain model provides potential insights into the genotype-phenotype relationship for several classes of human connective tissue diseases, mechanisms of integrin clustering by fibrils, the polarity of fibril assembly, heterotypic fibril function, and connective tissue pathology in diabetes and aging.


Type I collagen, the predominant protein of vertebrates, polymerizes with type III and V collagens and non-collagenous molecules into large cable-like fibrils, yet how the fibril interacts with cells and other binding partners remains poorly understood. To help reveal insights into the collagen structure-function relationship, a data base was assembled including hundreds of type I collagen ligand binding sites and mutations on a twodimensional model of the fibril. Visual examination of the distribution of functional sites, and statistical analysis of mutation distributions on the fibril suggest it is organized into two domains. The
"cell interaction domain" is proposed to regulate dynamic aspects of collagen biology, including integrin-mediated cell interactions and fibril remodeling. The "matrix interaction domain" may assume a structural role, mediating collagen cross-linking, proteoglycan interactions, and tissue mineralization. Molecular modeling was used to superimpose the positions of functional sites and mutations from the two-dimensional fibril map onto a three-dimensional x-ray diffraction structure of the collagen microfibril in situ, indicating the existence of domains in the native fibril. Sequence searches revealed that major fibril domain elements are conserved in type I collagens through evolution and in the type II/XI collagen fibril predominant in cartilage. Moreover, the fibril domain model provides potential insights into the genotype-phenotype relationship for several classes of human connective tissue diseases, mechanisms of integrin clustering by fibrils, the polarity of fibril assembly, heterotypic fibril function, and connective tissue pathology in diabetes and aging.
Type I collagen is the most abundant protein in humans and other vertebrates, comprising much of the fibrous extracellular matrix scaffold of bones, tendons, skin, and many other tissues (1)(2)(3)(4). In general, type I collagen and its binding partners are proposed to provide mechanical strength and form to tissues. Collagenous scaffolds are laid down and remodeled by cells and are also a predominant substrate for cell interactions, migration, and differentiation. Consequently, various debilitating human diseases are associated with type I collagen mutations, including osteogenesis imperfecta (OI, 2 brittle bone disease), Ehlers Danlos syndrome, vascular disorders, and others (3,5). Type I collagen is also employed in human medicine as hemostatic sponges and implants to repair wounds and in tissue engineering applications as scaffolds (6).
Type I collagen is synthesized in the endoplasmic reticulum as ␣1 and ␣2 procollagen chains, each encoded by separate genes that are translated into proteins somewhat longer than 1000 amino acid residues (3,7). Nucleation domains on the C-terminal propeptide promote the polymerization of two ␣1 and one ␣2 chains into the procollagen triple helical monomer (Fig. 1, A and B). The triple helical domain of procollagen is composed of contiguous glycine-X-Y tri-peptide repeats, with the obligate glycine in the first position as its side chain is the only one small enough to fit within the coiled-coil of the triple helix. Extracellularly, N-and C-proteinases remove the globular termini of procollagen, and every ϳ67 nm along the fiber axis, five monomers assemble in a quarter-staggered fashion to form part of the supramolecular "helix," the microfibril. Each microfibril, the proposed subunit of the collagen fibril ( Fig. 1, C-E), and its immediate microfibrillar neighbors are connected by N-and C-terminal intermolecular cross-links. The basic repeating morphological structure of the fibril is the D-period, ϳ67 nm long, and composed of one overlap and one gap zone. Each D-period contains the complete monomer sequence derived from overlapping consecutive elements of five monomers (Fig. 1C, box). Other collagens, proteoglycans (PGs), and matrix macromolecules may further assemble with the fibril to impart tissue-specific properties to the heterotypic polymer (2,4,8). However, it is unclear how cells, matrix molecules, and other factors interact with the heterotypic collagen fibril. Therefore, a comprehensive structure-function model of the protein has been lacking.
To help elucidate type I collagen structure-function relationships, a data base of functional domains, ligand binding sites, and human disease-associated mutations mapping to type I collagen was previously assembled and presented as a two-dimensional map of the collagen fibril (9). It was reported that ligandbinding hot spots appear on collagen, and the positions of certain classes of mutations may correlate with disease phenotype. Subsequently the molecular, microfibrillar, and fibrillar structures of type I collagen have been determined through x-ray diffraction (10).
This study is a theoretical analysis of the interrelationships between hundreds of new functional domains, ligand-binding sites, and mutations on the collagen map. Moreover, the map is analyzed alongside of a three-dimensional model of the collagen microfibrillar structure that was recently determined (11). This analysis reveals novel insights into how the type I collagen fibril, and perhaps collagen fibrils in general, are organized to fulfill their crucial role as scaffolds for cell interactions and as the predominant structural elements of vertebrate tissues.

EXPERIMENTAL PROCEDURES
Ligand Binding Sites and Functional Domains-Positions of binding sites and functional domains were obtained from the literature and are indicated by labeled boxes placed next to the relevant sequences. Primary literature references for sites indicated on an earlier version of the map appear in the supplemental materials. Recently reported sites are referenced in legend to Fig. 2. Zones of interactions between ligands and broad regions of collagen fibrils, as observed by electron microscopy, are indicated by colored overlays. Many binding site locations were approximated based on low resolution approaches, such as electron microscopy of macromolecular complexes. On the other hand, the use of collagen model triple helical peptides (THPs) (12,13), in some cases, including novel peptide Toolkits composed of overlapping, 27-residue segments of collagen sequences (14), has allowed high resolution mapping of the positions of various ligand binding sites and functional domains on the collagen monomer.
In vivo, type I collagen-ligand interactions may depend upon the tissue source. Moreover, collagen fibrils may be heterotypic, i.e. contain other collagens such as types III and V (15), whose arrangements in the fibril are incompletely understood, and that could impact fibril-ligand binding. However, despite the degree of uncertainty implicit in the map, and that some discrepancies exist in the data base (9), the approach followed here is to identify fundamental features of collagen structure and function based on numerous lines of evidence and by focusing on the best characterized functional elements of collagen.
Identifying Interrelationships between Sites-Relevant relationships between binding sites and functional domains were identified in four ways. First, as sequences on the collagen monomer to which more than one ligand has been shown to bind, or that are near neighbors; second, as sequences that fall within the borders of a fibril region shown to bind a particular ligand; and third, as neighboring binding sites on adjacent monomers within the D-periodic packing scheme. In this last case it is proposed that interactions between ligands on adjacent monomers may be possible if their binding sites align vertically within the D-period, and if their two-dimensional structure allows them to simultaneously bind more than one triple helix and reach another ligand or ligand-binding site on neighboring monomer(s). Here, sites on two or more monomers are considered as aligned vertically if they overlap with or fall close to any FIGURE 1. Assembly and structure of the type I collagen fibril. A fragment of a single type I collagen triple helix (monomer) is depicted (A). Type I collagen is secreted as a procollagen monomer Х 300 nm long, extracellularly, the propeptides (N and C) are cleaved by proteinases (dashed vertical lines, B) and the tropocollagen monomers assemble in a staggered fashion into collagen fibrils (C) where one D-period repeat (expanded two-dimensional view of 67 nm segment of microfibril, box) contains the complete collagen sequence from elements of the five monomers and includes an overlap and gap zone. The subunit structure of the fibril can be considered to be the D-periodic 5-mer referred to as the microfibril (D). Collagen fibrils appear as periodic banded structures by electron microscopy; arrow, left border of overlap zone; particles are heparin-albumin gold conjugates used to map the heparin-binding site (E). The collagen map in Fig. 2 represents an expanded view of one D-period. Modified from a figure reproduced from J. Cell Biol. by copyright permission of The Rockefeller University Press (53).
axis drawn perpendicular to the long axes of the monomers, and joining monomers 1 and 5. Fourth, select data from the two-dimensional map were correlated with the known threedimensional packing structure of collagen within the fibrillar context (10,16).
Correlating Two-dimensional Collagen Map Data with a Three-dimensional Collagen Microfibril Model-The collagen microfibril model used in this study was composed from the packing structure of rat tendon type I collagen molecules observed in situ as described (10). Fiber diffraction data collected from native and derivatized (heavy atom labeled) rat tail tendons was used to solve the in situ structure of the collagen molecules and microfibril using multiple isomorphous replacement (16). A molecular model was constructed based on the primary sequences of the ␣1 and ␣2 chains of rat type I collagen, and the superhelical parameters were determined from collagen-like peptide crystallographic structure determinations. The electron density map representing the experimentally determined microfibril has a resolution of 0.516 nm in the direction of the fiber (collagen molecule) axis and 1.11 nm in the direction perpendicular to this. To determine the position of functional sites from the two-dimensional collagen map on the three-dimensional microfibril model, solvent-accessible surface calculation and rendering was performed using SPOCK (17) with the default probe size of 0.14 nm. Distances between surface sites were determined by line-of-sight measurement of atoms central to each site and located at the microfibril surface, assembled by using the instructions and coordinates provided by RCSB structure file 1Y0F (10). To construct images from these analyses, the C␣ "worm" traces of relevant portions of individual triple helices were marked using a semi-transparent surface rendering. The semi-transparent surface was then rendered in the appropriate colors to represent the positions of relevant functional sequences and binding sites. That rat and human collagen protein sequences are highly homologous justifies the approach of localizing functional domains of human type I collagen on the rat type I collagen microfibril.
Three-dimensional Modeling of Integrin-Collagen Interactions-Molecular modeling of ␣2␤1 integrin I-domain binding to the collagen triple helix was performed on a Silicon Graphics (Octane) computer system using the SYBYL software package, version 7.2 (Tripos Inc.). A model of the collagen I triple helix was created (18) with a 36-amino acid peptide spanning the ␣1-glycine 475 to ␣1-asparagine 510 region and a corresponding ␣2-glycine 475 to ␣2-alanine 510 region of human collagen IA, in which a fructopyranose residue was linked to ␣2-lysine 479. To minimize end group effects, the amino terminus was substituted with an N-acetyl group and the C terminus with NHCH 3 . The model was energy-minimized using a conjugate gradient method and subject to repeating cycles of molecular dynamics using Kollman force fields and united atoms (19). To illustrate the interaction of the ␣1(I) GFPGER 502-507 sequence with the integrin I-domain a crystal structure of a complex between the I-domain and a triple helical collagen peptide was used (Protein Data Bank, ID code 1dzi). The intermolecular energy of the interaction was analyzed with the SYBYL/Dock module to identify possible binding confor-mations. Surface calculations of the molecules were analyzed using SYBYL/Molcad module.
Human Mutations-Mutations were obtained from the Database of Human Type I and Type III Collagen Mutations (www.le.ac.uk/genetics/collagen/), the OI consortium mutation data base (5,20), or published elsewhere (20 -27). Unpublished mutations were detected by DNA sequencing (28) and were identified in patients referred to a clinical DNA diagnostics laboratory (Reading, PA) for mutation analysis of the COL1A1 and COL1A2 genes. Clinical diagnosis of OI was made by the referring medical personnel.
Collagen Sequence Analysis-Sequence homologies between human fibrillar collagens were calculated using the mature collagen primary sequences and the ClustalW function of MacVector 9.0 (Accelrys). In homology determinations between two proteins the number of homologous residues divided by the number of residues in the larger protein yielded the percent homology value. The find feature of the program was used to search for functional domain sequences of interest. Statistics-Statistical analyses of mutation distributions on the collagen fibril appear under "Results" and in the supplemental materials.

RESULTS AND DISCUSSION
The collagen map (Fig. 2) consists of the protein sequences of the human ␣1 and ␣2 chains of type I collagen, arranged as they are proposed to occur in the fibril, upon which positions of structural landmarks, ligand binding sites, and disease-associated mutations are superimposed.
Ligand Binding Site Distribution-When the distribution of ligand-binding sites was examined on the map, more were found on the C-terminal half of the collagen monomer ( Fig. 2), as noted previously (9). Moreover, three concentrations or "hot spots" of ligand binding are again evident, designated as major ligand binding regions (MLBRs) (dashed boxes 1-3, Fig. 2). The potential functional implications of some of the overlapping ligand binding sites were discussed before (9). A previously unobserved and less conspicuous hot spot appears for the first time on the revised map on and around sequence glycinephenylalanine-hyroxyproline-glycine-glutamic acid-arginine (GFPGER) 502-507 (Fig. 2, orange box), previously discovered as the predominant ␣1␤1/␣2␤1/␣11␤1 integrin and cell binding

Matrix Interaction Domain
P Q G V Q G G K G E Q G P A G P P G F Q G L P G P S G P A G E V G K P G E R G L H G E F G L P G P A G P R G E R G P P G E S G A A G P T G P I G S R G P S G P P G P D G N K G E P G V V G A V G T A G P S G P S G L P G E R G A A G I P G G K G E K G E P G L P A G P A G E R G E Q G P A G S P G F Q G L P G P A G P P G E A G K P G E Q G V P G D L G A P G P S G A R G E R G F P G E R G V Q G P P G P A G P R G A N G A P G N D G A K G D A G A P G A P G S Q G A P G L Q G M P G E R G A A G L P G P K G D R G D A G P P A G P A G E R G E Q G P A G S P G F Q G L P G P A G P P G E A G K P G E Q G V P G D L G A P G P S G A R G E R G F P G E R G V Q G P P G P A G P R G A N G A P G N D G A K G D A G A P G A P G S Q G A P G L Q G M P G E R G A A G L P G P K G D R G D A G P H R G F S G L Q G P P G P P G S P G E Q G P S G A S G P A G P R G P P G S A G A P G K D G L N G L P G P I G P P G P R G R T G D A G P V G P P G P P G P P G P P G P P S A G F D F S F L P Q P P Q E K A H D G G R Y Y R A R G D K G E T G E Q G D R G I K G H R G F S G L Q G P P G P P G S P G E Q G P S G A S G P A G P R G P P G S A G A P G K D G L N G L P G P I G P P G P R G R T G D A G P V G P P G P P G P P G P P G P P S A G F D F S F L P Q P P Q Ligand binding site and mutation map of the human type I collagen fibril. Protein sequences of the triple helix are shown (GenBankTM, ␣1(I) accession #NP000079.2 and ␣2(I) NP_000080.2; proline and hydroxyproline are designated as P, the latter being the third position in glycine-X-Y sequences. Note that Ala-459, ␣2(I) results from a single nucleotide polymorphism observed in some patient samples in the non-redundant protein data base, whereas ␣2(I) Pro-459 appears in the NP_000080.2 sequence. Yet, the association of Ala to Pro-459 substitution mutations with intracranial aneurisms (26) justifies its sequence of type I collagen (29 -31). Integrins are heterodimeric cell surface receptors that bind extracellular matrix molecules and are proposed to play roles in tissue morphogenesis, matrix assembly, and cell signaling (32,33). Studies using THPs and molecular modeling approaches indicate that regions of the triple helix with high frequencies of ligand binding sites and functions do not have obvious correlations with local (at the scale of few triplets) characteristics in the conformation or stabilities of the triple helix (34). On the other hand, analysis of regional variations in the triple helix stability based on collagens with OI mutations identified two large flexible triple helical regions that aligned with zones proposed important for fibrillogenesis and ligand binding, which overlap with MLBR2 and perhaps MLBR1 (35). Putative binding sites for many of the major ligands exist on multiple monomers across the two-dimensional fibril map (Fig. 2); however, for the native collagen fibril it is not yet clear which of these sites are available for cell and ligand interactions.
Mutation Distribution-The phenotypic consequences of the most prevalent class of mutations on the map, those associated with OI, are in general proposed to arise from their disruption of collagen monomer folding or stability, not ligandfibril interactions. OI is generally divided into four clinical types: type I (mild), II (lethal), III (severe), and IV (moderately severe) (5). The gradient model suggests that, because the helix folds in the C-to N-terminal direction, mutations near the C terminus affect collagen modification and assembly more significantly, overall resulting in a more severe phenotype (36). However, several exceptions to this pattern were observed (5); most relevant to the present study was the discovery that high concentrations of lethal OI mutations co-localize with cell and ligand-binding sites. For example, on the ␣2(I) chain clusters of lethal OI mutations tend to coincide with putative zones of PG-fibril interactions (Fig. 2, red brackets) (5). New observations from the collagen map extend these findings. Thus, the N-terminal site for intermolecular cross-linking is associated with several lethal OI mutations but is bracketed by mild mutations, and sites for ␣1␤1/␣2␤1 ligation at GFPGER 502-507 and MMP-1 cleavage co-localize with mutation "silent zones" that may reflect embryonic lethality (see below) (Fig. 2). Finally, non-OI mutations also fail to follow an N-to C-terminal gradient of severity; yet cluster to several zones of the fibril D-period (supplemental Fig. S1).
Domain Model of the Collagen Fibril-Gross examination of the collagen map brings home a point previously reported, that much of the fibril is covered by PGs (2, 8), structural macromolecules considered to be full-time binding partners of collagen fibrils in many tissues (Fig. 2, yellow, pink, and purple overlays). PGs are composed of core proteins to which are covalently linked one or more high molecular weight, anionic glycosaminclusion here. Ligand binding regions are indicated by rectangular boxes adjacent to relevant sequences; ligand binding to type I procollagen or tropocollagen (gray), or to either ␣1(I) or ␣2(I) chains (unshaded). Ligand binding hot spots 1, 2, and 3 are enclosed by dashed-lined black boxes. Highlighted sequences include: the ␣1␤1/␣2␤1 integrin binding site GFPGER 502-507 (maroon rectangle); MMP-1 cleavage site (green arrow); N-and C-terminal intermolecular cross-links (purple rectangles). Studies responsible for mapping many of the functional domains on collagen are cited in the first map publication (9) and in the supplemental materials. New sites on the revised map include endo180 (57)

Domain Model of the Collagen Fibril
inoglycan chains. PGs bound to the fibril may thus be expected to strongly influence fibril ligation of other factors competing for overlapping or neighboring binding sites. On the other hand, GFPGER 502-507 , potentially the most critical cell interaction sequence of the fibril (see below), along with the matrix metalloproteinase (MMP) 1, 2, and 13 cleavage sequence and several other key ligand binding sites, localizes to the "b1/b2" bands region where PGs are absent (Fig. 2, wide vertical nonshaded region). These initial observations suggested the collagen fibril to be organized into two domains, one mediating dynamic cell-collagen interactions, and the other carrying out structural duties. Analysis of the map and microfibril structure model provided further support for this hypothesis, as detailed below.
Cell Interaction Domain-Cell-associated molecules proposed to bind type I collagen include the integrins, PGs, discoidin domain receptors, and molecules bridging cell surfaces and the extracellular matrix, such as fibronectin and SPARC (secreted protein, acidic, and rich in cysteine) (Fig. 2). However, evidence suggests integrin binding to be the most crucial determinant in cell-collagen interactions. Integrins that bind type I collagen include the ␣1␤1, ␣2␤1, ␣11␤1, and ␣V␤3 receptors. Because the ␣V␤3 integrin is thought to only interact with degraded collagen, it will not be further considered. Candidate ␣1␤1/␣2␤1/␣11␤1 integrin binding sequences on type I collagen have been mapped to residues 127-132, 502-507, and 811-816 (Fig. 2).
GFPGER 502-507 : Ground Zero of Type I Collagen-Analysis of published data in the context of the map suggests a critical role for the central integrin binding site, GFPGER 502-507 (Figs. 2 (orange box) and 3A), according to four lines of evidence. First, this sequence binds ␣1␤1/␣2␤1/␣11␤1 integrins and supports angiogenesis, endothelial activation, osteoblast differentiation, and cell adhesion. Second, it resides in the center of the largest "PG-clear zone" (Figs. 2 and 3A, wide vertical nonshaded region), ensuring its availability, and consistent with its dedicated function. Third, GFPGER 502-507 lacks reported mutations, and also coincides with zones containing contiguous stretches of three or more first position glycines that are silent for mutations on the ␣1 and ␣2 chains (Figs. 2 and 3A, blue boxes), and in the fibril on M1 and M5 (Fig. 3B, zones 2 and  7). Such zones may be so critical to collagen function that mutations occurring there are embryonically lethal (37). These data imply that, during embryogenesis, cell-collagen interactions using GFPGER 502-507 are crucial for fibril assembly, remodeling, or morphogenic processes such as angiogenesis. Moreover, considering the mutation density on the map, statistical analyses suggest that, although there is a high probability that the seven mutation silent zones may exist by random chance, it is very improbable they would align within such a narrow region of the D-period (see "Statistics" below), implying its special functional significance. Fourth, GFPGER 502-507 is also a near neighbor of MLBR2, the highest concentration of ligand-binding sites of the fibril, including the site for cleavage by MMP-1, -2, and -13 required for collagen remodeling (Fig. 4A) (10). This constellation of sites also falls largely within the PG-clear zone. Analysis of the x-ray diffraction structure of the microfibril (16) confirmed these key sequences to be near neighbors, only ϳ3-11 nm apart on the 67 nm wide D-period (Fig. 4B). Thus, on the collagen microfibril, elements of M3 and M4 within the b1/b2 bands region, including GFPGER 502-507 , the MMP cleavage site, and MLBR2 are proposed to comprise a cell interaction domain where cell-collagen interactions and fibril remodeling are controlled by cell surfaces (Figs. 2 (orange box) 4).
Two assumptions regarding the importance of GFPGER 502-507 to fibril function deserve further comment. First, GLPGER 127-132 and GASGER 811-816 in type I procollagen and in THP form also bind ␣1␤1/␣2␤1 integrins (38). At the level of the fibril, these sequences may overlap with PG binding zones, and at the monomer and fibril levels do not coincide with large mutation silent zones on both the ␣1 and ␣2 chains, distinguishing them from GFPGER 502-507 . Yet, the relative functions of GFPGER 502-507 , GLPGER 127-132 , and GASGER 811-816 in cell-collagen interactions must be resolved through further experimental investigation. A key consideration is whether these sites prove to be available for cell interactions in the native fibril. Second, a murine knockout of the ␣2␤1 integrin receptor yielded mice with a largely normal phenotype (39), which seems at odds with the proposal made here that GFPGER 502-507 is crucial for cell-fibril interactions and embryogenesis. However, in the integrin null mouse, compensation for ␣2␤1 function could potentially arise from the expression of other collageninteractive receptors, including the ␣1␤1 or ␣11␤1 integrins.
Statistics-Statistics was used to determine whether the localization of the seven mutation silent zones to a relatively narrow fibril region (Figs. 2, 3A, and 3B) is due to random chance. Thus, several properties related to the spatial distribution of first-position glycine (G 1 site) mutations in collagen were investigated. To compute the p values of interest required the derivation of some novel combinatorial quantities and algorithms. A complete presentation of these analyses appears in the supplemental materials; the following is a summary of the findings. G 1 site mutations do not occur in a spatially uniform way along the collagen monomer (p values 3.6 ϫ 10 Ϫ27 and 4.3 ϫ 10 Ϫ17 ) under two reasonable models of spatial homogeneity. The adjacency patterns of mutation-free G 1 sites, however, are quite consistent with the hypothesis that they have no tendency to cluster together or remain apart (p values all ϳ0.5).
On the other hand, several long sequences of adjacent mutation-free G 1 sites are arranged within a narrow vertical region of the D-period. The probability of their falling in this region, under a uniform random placement model, is 4.6 ϫ 10 Ϫ4 . The probability of their falling in any such region throughout the entire D-period is no more than 0.0025. Thus, the apparent clustering of mutation-free zones on the fibril likely did not arise randomly.
Collagen Glycation and Ligand-Collagen Interactions-Glycation is the non-enzymatic addition of sugar to protein that occurs during aging and in an accelerated fashion in diabetes (40,41). Thus, glucose reacts with the ⑀-amino group of a free lysine residue to generate a Schiff-base intermediate, followed by its rearrangement to the more stable Amadori product. Through subsequent reactions and modifications, fructosyl-lysines may then form cross-links with other proteins, forming advanced glycation end products. Collagen thus modified may become less flexible and exhibit altered cell-and ligand-colla-gen interactions and fibrillar structure (41)(42)(43)(44)(45). Glycation occurs on numerous collagen residues but preferentially on hydroxylysines ␣1(I) 434, and ␣2(I) 453, 479, and 924 (Fig. 2), and the "c1" band fibril region (Figs. 2 and 3A, blue stripe). Because glycated collagen is a poor substrate for cell interactions, it is notable that lysine 479 and GFPGER 502-507 are near neighbors on the two-dimensional fibril map (Fig. 2). To examine this relationship, a three-dimensional model was built of the triple helix spanning residues 475-510, with GFPGER 502-507 ligated to the ␣2␤1 integrin-I-domain (Fig. 3C). The lysine 479 fructosyl adduct was found to project outwards from the triple helix, poised to affect collagen-ligand interactions. Further, whereas glycation would likely not directly interfere with integrin-I-domain binding, interactions of the integrin heterodimer may potentially be affected. Examining cell and integrin interactions with native and glycated THPs, including sequences spanning the integrin binding site and lysine 479, could test such hypotheses. Lastly, it was reported that the fibril binding zone for keratan sulfate PGs, including the "c" bands, overlaps with all of the predominant glycation sites, in contrast  Figs. 2 and 3B). GFPGER 502-507 neighbors the major region of non-enzymatic glycation of the fibril (blue stripe) and the glycation substrate, lysine 479 (Fig. 3C). B, GFPGER 502-507 and mutation silent zones localize to a narrow region of the fibril surface. GFPGER 502-507 (maroon); mutation silent zones numbered sequentially from N to C termini (blue). C, glycation may influence integrin-collagen interactions. Model of ␣2 integrin subunit I-domain (purple) binding to GFPGER 502-507 adjacent to ␣2(I) fructosyl-lysine 479 (sugar). Note that glycation may potentially affect collagen ligation by the integrin heterodimer (gray arrow), but likely not the I-domain. On collagen: ␣1 GFPGER 502-507 (yellow); mutation silent zone (blue); glycine residues associated with lethal OI mutations (red). Gray arrow, 10 nm is the maximal estimated reach of the integrin heterodimer.
to that of dermatan sulfate PGs that localize to the "d" and "e" bands (9). Consistent with this observation, it was reported that keratan sulfate PGs, but not dermatan sulfate PGs, exhibit reduced affinity for glycated collagen (46). Together these observations justify further investigations into the consequences of glycation to cell-and ligand-collagen interactions.
Matrix Interaction Domain-The map suggests that, aside from the cell interaction domain, the remainder of the D-period comprises a substrate for the binding of structural molecules (Figs. 2 and 4), which is supported by various lines of evidence. Thus, dermatan sulfate PGs occupy the e and d bands regions, keratan sulfate PGs fall within the a and c bands regions, and heparin binds the a bands region. Moreover, intermolecular cross-links in type I collagen exist between residues 87 and 930, and between type I collagen at residues 1023-1036 and type V collagen in the a and c band regions. Notably, the N-and C-terminal cross-links closely bracket the cell interaction domain, potentially stabilizing its complement of key sites, or otherwise influencing its function or availability. Next, the gap or "hole zone" and its associated proteins are proposed to contain the nucleation sites upon which hydroxyapatite crystal growth during bone mineralization occurs. Last, several domains important for collagen fibrillogenesis map to the a bands region. Thus, the overlap zone and lateral borders of the gap zone may comprise a "matrix interaction domain" where organic and inorganic matrix molecules bind to connect the fibril with other extracellular matrix elements, and to impart distinct structural properties to collagen fibrils and the tissue stroma in which they reside (Figs. 2 (black brackets) and 4).
Non-overlapping Fibril Domains-Integrins and MMPs (Ͻ5 nm wide) may be envisioned to interact with their binding sites in the cell interaction domain of ϳ20 nm wide and thus mainly influence ligands binding adjacent sequences. However, the impact of fibril-bound PGs to cell-collagen interactions may be more considerable. For example, the decorin core protein of ϳ40 kDa may contain a glycosaminoglycan chain averaging 40 kDa (47). The core protein is proposed to bind collagen monomer(s) in the d/e bands region. However, if its glycosaminoglycan chain is unrestrained, it could reach 80 nm or more, potentially disrupting cell-and ligand-collagen interactions at distant sites. Instead, it is proposed that the anionic glycosaminoglycans are mostly constrained within the matrix interaction domain via their binding to the transverse electropositive bands on the fibril surface, adjacent to where the PG core protein binds (Figs. 1, 2, and 5). Thus, the fibril is viewed to consist of non-overlapping cell and matrix interaction domains (Figs. 4 and 5). Nonetheless, some cross-talk between domains may be likely, e.g. if glycation affects integrin-GFPGER 502-507 ligation, or matricellular proteins like SPARC with broad fibril-binding footprints influence the function of both domains simultaneously (Figs. 2 and 4).
That numerous functional sites of the type I collagen molecule fall into either of two fibril domains according to their function and that a preferred glycation substrate, lysine 479, coincides with the major fibril glycation zone support the accuracy of the overlap model of collagen fibril structure (48) upon which the map is based. Lastly, FACIT (fibril-associated collagens with interrupted triple helices) collagens (49) are proposed to interact with type I or II fibrils to influence inter-monomer ; lateral edges of the fibronectin binding site (blue); keratan sulfate PG binding sites (yellow); dermatan sulfate PG binding sites (pink); N-and C-terminal intermolecular cross-links (black and blue arrows), respectively. The SPARC binding site (black) was rendered transparent where it overlaps the fibronectin and MMP binding/cleavage sites. The three-dimensional image was derived from an x-ray diffraction structure of the collagen microfibril in situ. The microfibrillar aspect that faces toward the outside of the fibril is at the top of the picture (the C terminus points outside of the fibril, while the N terminus faces inside the fibril). Distances between GFPGER 502-507 and domains for MMP interaction, or the N-terminal end of the fibronectin-binding domain were 11.2 and 5.0 nm, respectively. A view of microfibril function in the context of the collagen fibril appears in Fig. 5. associations within the fibril or inter-fibril dynamics (50). Whether FACIT collagens modulate the function of the cell and matrix interaction domains proposed here must await further understanding of the physical nature of FACIT-fibril interactions.
Domains in Other Fibrillar Collagens-The best characterized elements of the cell and matrix interaction domains of the human fibril as proposed here include sites for integrin binding, GFPGER 502-507 ; MMP-1 cleavage, GPQGIA; and N-and C-terminal intermolecular cross-linking, GMKGHR and GIKGHR, respectively (Fig. 2). These elements were searched for in other fibrillar collagens (data not shown). The sequences, and their positions in the triple helices relative to the cross-link sites, are identical or highly homologous in type I collagens of other vertebrates and in the human type II/XI collagen heterotypic fibril predominant in cartilage. Thus, it may be speculated that vertebrate fibrillar collagens share the same domain structure.
Domains in Heterotypic Fibrillar Collagens-The type I collagen heterotypic fibril often contains type III and V collagens, in which some of the domain elements are altered in ways predicted to be functionally significant (51). It thus is proposed that the physical and cell-interactive properties of heterotypic fibrils may be modulated according to their contents of minor collagens carrying variants of one or more key sequences. For example, type V collagen, which lacks both an active integrin binding site at the appropriate position and an MMP-1 cleavage site, when incorporated in the type I fibril, may hinder cell interactions and fibril remodeling in tissues where low rates of collagen metabolism are desirable, such as the cornea.
Domains and the Polarity of Fibril Assembly-That type I and type II collagens and their fibrillar collagen-binding partners may share the same domain structure implies they assemble in parallel into the heterotypic fibril, where key functional domains fall into register. That atypical mutations in type I and III collagens appear at disparate regions of their triple helical domains, yet cluster to common zones of the fibril supports this contention (supplemental Fig. S1). However, although collagen monomer assembly into fibrils in vivo occurs most commonly in a head-totail orientation, head-to-head, or tail-to-tail assembly is also seen and plays a role in tissue morphogenesis and repair (52). It is thus notable that GFPGER 502-507 , proposed here to be the most critical sequence of type I collagen, is located midway between the N-and C-terminal cross-links. Therefore, if monomers assemble in a unipolar or bipolar direction into fibrils, GFPGER 502-507 remains accessible for cell interactions.
Regulation of Cell-Fibril Interactions-The domain structure of the collagen microfibril was next considered in the context of the multivalent collagen fibril (Fig. 1). Although the fine structure of the collagen fibril surface remains incompletely understood, our following speculations assume that, at least in some cases, ligand-binding sites may align between microfibrils, as has been observed for some collagen-binding ligands (8,53), although there are exceptions (54). Thus, in the fibril it is proposed that cell interaction domains appear every 5-6 nm across FIGURE 5. Domain regulation of collagen fibril function. The collagen fibril is proposed to be composed of cell interaction domains (unshaded) and matrix interaction domains (yellow and pink shading). In the cell interaction domain integrin binding sites promote cell adhesion and integrin receptor ligation, clustering, and activation. The confluence of crucial cell and ligand-binding sites on the fibril surface implies their coordinate regulation. Thus, fibril-bound integrins may modulate other ligand-collagen interactions, and MMP cleavage of one monomer may enhance or disrupt ligand-fibril interactions elsewhere. In the matrix interaction domain, PGs and other matrix molecules bind to impart key structural properties to the fibril. Note: the fibril shown in this schematic includes ten microfibrils and spans two and a half, 67 nm wide collagen D-periods; however, the average collagen fibril in vivo contains many more microfibrils than depicted here and may reach hundreds of microns in length (52). the fibril's width, and at 67 nm intervals along its length (Fig. 5). Assuming integrin heterodimers to be Յ10 nm in diameter (55,56) and collagen microfibril diameters ϳ5-6 nm (10), GFPGER 502-507 -bound integrins are proposed to cluster in stripes perpendicular to the long axis of the fibril, with one integrin bound per one or two microfibrils. According to this model, the fibril is an ideal substrate for integrin clustering, considered a key component of receptor activation and signaling (51,56). Other collagen-binding ligands of ϳ5-10 nm in diameter may potentially interact with one or more cell interaction domains, each occupying areas of ϳ5 ϫ 20 nm on the fibril. It follows that the cell interaction domain may accommodate one, or at most several ligands coincidently, implying that cell-and ligand-binding interactions may be coordinately regulated. Therefore, it is speculated that integrins may positively or negatively modulate the interactions of other ligands with the fibril, or vise versa. Furthermore, MMP cleavage of one monomer may destabilize the fibril, displacing ligands, including integrins from neighboring monomers, or, conversely, could facilitate cell-fibril interactions (11) (not shown).
In summary, analysis of the distribution of functional sites and mutations on a two-dimensional model of the type I collagen fibril, and on an x-ray diffraction structure of the microfibril in situ, has identified candidate cell and matrix interaction domains. Moreover, such fibril domains may exist in heterotypic type I fibrils and other vertebrate fibrillar collagens. Defining the fine structure of the fibril domains and their functional inter-relationships may open new avenues for the therapeutic modulation of collagen function and metabolism in diseases, including fibrosis, atherosclerosis, and OI, where the protein plays a prominent role and for the engineering of fibrillar collagens and synthetic polymers for numerous applications in human medicine.