The Human COL11A2 Gene Structure Indicates that the Gene Has Not Evolved with the Genes for the Major Fibrillar Collagens*

The human COL11A2 gene was analyzed from two overlapping cosmid clones that were previously isolated in the course of searching the human major histocom- patibility region (Janatipour, M., Naumov, Y., Ando, A., Sugimura, K., Okamoto, N., Tsuji, K., Abe, K., and Inoko, H. (1992) Immunogenetics 35, 272–278). Nucleotide sequencing defined over 28,000 base pairs of the gene. It was shown to contain 66 exons. As with most genes for fibrillar collagens, the first intron was among the larg- est, and the introns at the 5 (cid:42) -end of the gene were in general larger than the introns at the 3 (cid:42) -end. Analysis of the exons coding for the major triple helical domain indicated that the gene structure had not evolved with the genes for the major fibrillar collagens in that there were marked differences in the number of exons, the exon sizes, and codon usage. The gene was located close to the gene for the retinoic X receptor (cid:98) in a head-to-tail arrangement similar to that previously seen with the two mouse genes (P. Vandenberg and D. J. Prockop, submitted for publication). Also, there was marked in- terspecies homology in the intergenic sequences. The amino acid sequences and the pattern of charged amino acids in the major triple helix of the (cid:97) 2(XI) chain suggested that the chain can be incorporated into the same molecule as (cid:97) 1(XI) and (cid:97) 1(V) chains but

Over 19 types of collagens are known, each with an apparently unique biological function (1)(2)(3). A major subclass is the fibrillar collagens that form ordered extracellular fibrils and that include type I, type II, type III, type V, and type XI collagens. Type I and type III collagens are found in most non-cartilaginous tissues. Type II is found primarily in cartilage where it is the most abundant protein, but it is also present in the vitreous humor and several other tissues in early embryonic development. Type XI collagen was originally recognized as a minor fibrillar collagen in cartilage that was similar to type II collagen. The protein was considered to consist of three ␣ chains referred to as ␣1(XI), ␣2(XI), and ␣3(XI) (4 -6). The ␣3(XI) chain was subsequently shown to be derived from the same gene as the ␣1(II) chain of type II collagen that, by an unknown mechanism, was assembled with the ␣1(XI) and ␣2(XI) chain to form a unique procollagen molecule (4 -9). In further analyses, type XI collagen was found to be closely related in structure to type V collagen, and both type V and type XI collagens were found in small amounts in a variety of cartilaginous and non-cartilaginous connective tissues (10,11). Amino acid sequencing of fragments of collagen fibrils from mammalian vitreous humor demonstrated that fibrils were assembled from molecules containing ␣1(XI) and ␣2(V) chains (12). Also, during the maturation of articular cartilage, isolated fractions rich in type XI collagen were found to contain an increasing proportion of the ␣1(V) chain and a decreasing proportion of ␣1(XI) chains (7). These observations and others (13)(14)(15)(16)(17) led to the suggestion (2) that type V and type XI collagens are a heterogeneous class of collagens comprised of five or six different ␣ chains, i.e. the ␣1(V), ␣2(V), ␣3(V), ␣1(XI), and ␣2(XI) together with the ␣3(XI) chain, which apparently has the same primary structure as the ␣1(II) chain.
Here, we report the complete structure of the human gene for the ␣2(XI) chain of type XI collagen. The results demonstrate that (a) the gene structure did not evolve with the genes for the major fibrillar collagens in that there were marked differences in the number of exons, the exon sizes, and the codon usage (1, 3); (b) the gene was located close to the gene for the retinoic X receptor ␤ in a head-to-tail arrangement similar to that seen previously with the two mouse genes, and there was marked interspecies homology in the intergenic sequences; 1 (c) the amino acid sequences and the pattern of charged amino acids in the major triple helix of both the ␣1(XI) chain (11) and ␣2(XI) chain (18,19) as well as the ␣1(V) chain (20) differed from the charged amino acid pattern seen with the ␣3(XI)/␣1(II) chain (21), an observation suggesting that the ␣3(XI)/␣1(II) chain is not incorporated into the same molecule; and (d) the structure of the C-propeptide 2 was similar to the C-propeptides of the pro␣1(XI) chain and pro␣ chains of other fibrillar collagens, but it was shorter because of internal deletions of about 30 amino acids.

Characterization of Cosmid Clones for the Human COL11A2 Gene-
The human COL11A2 gene was obtained from two overlapping cosmid clones, 505-1 and 515, that were isolated in the course of searching the human major histocompatibility complex region (22). The presence of the COL11A2 gene in the clones was confirmed by Southern blot hy-bridization with mouse and human probes for the same gene (18,19). 1 Nucleotide Sequencing of the Cosmid Clones-Nucleotide sequencing was carried out either directly with the cosmid clones or with the subclones in a plasmid vector (pBluescript, Stratagene) using the dideoxynucleotide sequencing method (23) and modified T7 DNA polymerase (Sequenase 2.0, U. S. Biochemical Corp.). Some of the nucleotide sequencing was carried out by cycle sequencing of PCR products (dsDNA Cycle Sequencing System, Life Technologies, Inc.). For cycle sequencing, the 5Ј-primer was labeled with T4 polynucleotide kinase (U. S. Biochemical Corp.) and [␥-33 P]ATP (DuPont NEN). To isolate cosmids or plasmid subclones for sequencing, alkaline lysis and CsCl centrifugation was used (24). Alternatively, the cosmids or plasmids were purified by alkaline lysis and absorption to a commercial column of coated silica gel (Qiagen Plasmid Maxi kit, Qiagen Inc.). The RNaseI treatment step was omitted.
To prepare templates for sequencing, 3-5 g of a cosmid clone or plasmid subclone in 10 mM Tris-HCl buffer (pH 8) and 1 mM EDTA was denatured with 0.2 M NaOH for 5 min at room temperature in a total volume of 20 l. The sample was precipitated by adding 2 l of 5 M NaCl and 55 l of 100% ethanol and then incubating the sample on dry ice for 10 min. The sample was centrifuged for 10 min, and the pellet was washed with 200 l of 70% ethanol, dried, and dissolved in 7 l of distilled water. For annealing, 25 ng of a 17-or 18-mer primer was used in a reaction volume of 10 l. The samples were annealed at 37°C for over 30 min. For the initial DNA sequencing, primers were designed based on published sequences of the COL11A2 gene by Kimura et al. (18) and Zhidkova et al. (19). Primers for the retinoic X receptor ␤ gene were designed from the published sequences by Fleischhauer et al. (25) and Epplen and Epplen (26). Additional primers were developed based on the sequences derived during the progress of the work. Nucleotide sequence analysis was carried out using the Wisconsin Sequence Analysis Package (GCG) Version 8.0-UNIX (Genetics Computer Group).
3Ј-RACE Analysis-To define the 3Ј-end of the gene, a reverse transcriptase-PCR reaction was carried out. About 0.4 g of total RNA from human fetal cartilage (21) was reverse transcribed (GeneAmp RNA PCR kit, Perkin Elmer Corp.) using an oligo(dT) primer linked to a random sequence 5Ј-GACTGATCAGCGAATTCTACGTCGC(T) 20 . The single-stranded cDNA was amplified with three sequential PCRs with nested forward primers designed to hybridize to the 3Ј-end of the cDNA, i.e. TP-1 (5Ј-GGTGTGGTCCAGCTCACCTTC), TP-2 (5Ј-GAGCTGAGC-CCGGAGACTA), and TP-3 (5Ј-GAATTCAGAGATGGCTGCCAG). The reverse primer for each PCR was identical with the random sequence at the 5Ј-end of the oligo(dT) primer. The first PCR was in a volume of 25 l with 10 pmol of reverse primer and 10 pmol of TP-1 at 94.5°C for 55 s, 62°C for 45 s, and 72°C for 90 s for 40 cycles. About 2 l of 25 l were taken for a second PCR with 20 pmol of reverse primer and 20 pmol of TP-2 in a final volume of 50 l. The conditions were 94.5°C for 55 s, 58°C for 45 s, and 72°C for 90 s for 35 cycles. The products from the second PCR were separated by agarose gel electrophoresis, and a gel plug was used for a third PCR reaction with 20 pmol of reverse primer and 20 pmol of TP-3 in a reaction volume of 50 l. The conditions were 94.5°C for 55 s, 62°C for 45 s, and 72°C for 90 s for 30 cycles. The products of the third PCR were separated by agarose gel electrophoresis and extracted with a commercial resin (Qiaex Extraction Kit; Qiagen, Inc.). The extracted DNA was cloned into a plasmid vector (pT7 Blue T-Vector kit, Novagen). DNA from positive clones was sequenced using primers designed to the vector sequences (T7 Promoter Primer and Ϫ40 Primer, U. S. Biochemical Corp.).

Characterization of Cosmid
Clones-Two cosmid clones previously shown to contain the human COL11A2 gene (22) were characterized here by restriction mapping. Sites for selected restriction enzymes are shown in Fig. 1. Nucleotide sequencing of the human COL11A2 gene was carried out in part by direct sequencing of the cosmid clones and in part by sequencing of subclones in plasmids. A total of over 28,000 base pairs were defined. The gene was shown to contain 66 exons (Figs. 1 and 2). The data provided about 200 codons not previously obtained for the human gene (18,19), including the structures of exons 1, 2, 6, 8, and 66. As with most genes for fibrillar collagens (1,3), the first intron was among the largest. However, intron 8 was also large. Exons 6, 7, and 8 were recently shown to be alternatively spliced in mouse (27).
The 5Ј-End of the COL11A2 Gene-One of the cosmid clones (505-1) extended beyond the 5Ј-end of the COL11A2 gene. Therefore, it was searched for the presence of the retinoic X receptor ␤ gene (25,26) that was found 5Ј to the mouse COL11A2 gene. 1 The same head-to-tail arrangement of the two genes found in the mouse genome was also found in the human genome. Also, a large degree of identity was found between the mouse and the human intergenic sequences (Fig. 3). The distance from the potential poly(A) addition site of the retinoic X receptor ␤ gene to the start of translation of the COL11A2 gene was 1,362 bp. The separation of the two genes was, therefore, essentially the same in the human and mouse genomes. Four MAZ sequences (28) were found at the 3Ј-end of the retinoic X receptor ␤ gene. The sequences were largely conserved in the mouse.
The first 11 exons of the mouse COL11A2 gene were previously defined. 1 Comparison of the mouse gene with the human gene here established that the sizes of the first 11 exons were identical (Table I). Also, the sizes of the first 10 introns were similar.
The Major Triple Helical Domain-The junction exon separating the N-telopeptide from the major triple helical domain was exon 14, and the junction exon separating the 3Ј-end of the major triple helical domain from the C-telopeptide was exon 63 (Fig. 4). Therefore, the triple helical domain was encoded by 48 exons and part of the two junction exons. The exons for the major triple helical domain all began with a complete codon for glycine and, therefore, were similar to the exons for other fibrillar collagens (1, 3).
As indicated in Fig. 4, the coding region for the major triple helical domain of the ␣2(XI) chain had a large number of 54-bp exons. The remaining exons appeared to maintain a 54-bp motif in that seven exons were twice 54 bp (or 108 bp). Also, seven exons were 45 bp or the equivalent of 54 bp with a 9-bp deletion. These exons were similar in size, therefore, to many of the exons found in the major fibrillar collagens (1, 3). However, there was no exon of 162 bp as is found in each of the major fibrillar collagens. Also, there were no exons of 99 bp, the size of five exons in the COL2A1 gene. In addition, there were two exons of unusual size in that one exon was 90 bp (exon 40) and the other was 36 bp (exon 61). As indicated in Fig. 4, the number of exons and the pattern of exon sizes in the COL11A2 gene was different from the COL2A1 gene (29). Despite these differences, the number of amino acids for the major triple helical domain was 1,014, the same number as in major fibrillar collagens.
The differences between the COL11A2 gene and other genes for fibrillar collagens were also emphasized by comparison of a third base used in codons for glycine, proline, and alanine (Table II). The four bases were more uniformly used for glycine ␣2(XI), ␣1(XI), and ␣1(V) chains (Fig. 5). In contrast, there were marked differences between the amino acid sequences of these three chains and the sequences of the ␣3(XI)/␣1(II) chain. In addition, it was apparent that the distribution of positively charged amino acids was essentially the same in the ␣2(XI), ␣1(XI), and ␣1(V) chains, but the pattern of positively charged amino acids in all three of these chains differed markedly from the pattern of the ␣1(II) chain (Fig. 6). Similar results were obtained when the patterns of negatively charged residues were compared (not shown).
quences were found in all eight independent plasmid clones of PCR products that were analyzed. The 283 bp of 3Ј-non-translated region included a potential poly(A) addition signal sequence of ATTTAA. The ATTTAA is an unusual poly(A) addition signal sequence but is homologous to signal sequences seen in other genes, e.g. the AATAAA, AATTAA, and ATTAAA found in the mouse opsin gene (33). Also, the ATTTAA sequence found here was 24 bp upstream of a carboxyl-terminal nucleotide at the 5Ј-end of the poly(A) tail, a typical arrangement of a poly(A) signal and an RNA cleavage site (34). The 3Ј-end of the cosmid clone 515 included some of the sequences of the 3Ј-nontranslated region. Therefore, a reverse primer to sequences in the vector was used to sequence the 3Ј-end of the cosmid insert.
The results provided the coding sequences of the last exon (exon 66) and intron 65. The Structure of the C-propeptide-The sequences encoding the C-propeptide indicated that the structure in the pro␣2(XI) chain was about 30 amino acids shorter than the C-propeptides for pro␣1(I), pro␣1(II), pro␣1(V), pro␣2(V), and pro␣1(XI) chains because of internal deletions (Fig. 8). One major internal deletion consisted of 23-27 amino acids compared to the other five chains. Another deletion consisted of 9 amino acids that were present in the C-propeptides of the other five chains. The deletion of 23-27 amino acids included a region that was homologous in the other C-propeptides. As previously noted (18), the carbohydrate attachment sequence of Asn-X-(Ser/Thr) found in the other fibrillar procollagens was not present in the C-propeptide for ␣2(XI) at the same site, but an additional  potential carbohydrate site of Asn-Tyr-Thr was found more amino-terminal in the sequence of the ␣2(XI), ␣1(XI), and the ␣1(V) chains. Cysteine residues were conserved among the six C-propeptides except that the ␣2(V) chain lacked one of the eight cysteines and the ␣1(XI) chain lacked another.

DISCUSSION
Initial analyses of the genes for the major fibrillar collagens revealed a striking pattern in the exons encoding for the major triple helical domains of the proteins (see Refs. 1 and 3). All the exons began with a complete codon for glycine. Also, the sizes had a 54-bp motif in that most exons were 54 bp or simple multiples thereof. In addition, the bases used for the third position in codons for glycine, proline, and alanine were similar among the genes. Therefore, the results suggested that the genes for the fibrillar collagens evolved from a 54-bp exon that was duplicated during evolution. The suggestion was supported by the further observation that these features of the exons were conserved among man, rodents, and chick. Also, with one exception, the exon sizes were conserved among the four genes for type I collagen, type II collagen, and type III collagen. Analysis of the genes for type IV collagen demonstrated a different pattern in that many of the exons began with split codons for glycine, and the exon sizes did not show a consistent 54-bp motif. The genes for other non-fibrillar collagens also varied in exon structures, but it was generally assumed that all fibrillar collagens maintained the same gene structure through long periods of evolution. The results here present the first complete structure of a gene for a minor fibrillar collagen. The exon structure does not fit the pattern found in the genes for the major fibrillar collagens. Also, the codon usage differs. Therefore, the results demonstrate that the COL11A2 gene has not evolved with the genes for the major fibrillar collagens. Previous reports demonstrated that the codon usage for the ␣1(V) chain of type V collagen differed from that of other fibrillar collagens (20). Also, the distribution of charged amino acids in the ␣2(V) chain appeared to differ from the distribution in the major fibrillar collagens (38). Therefore, the genes for ␣1(V) and ␣2(V) chains of type V collagen may also have evolved differently from the genes for the major fibrillar collagens.
Considerable evidence has suggested that the ␣1(XI) and ␣2(XI) chains are incorporated into the same procollagen molecules as the ␣3(XI)/␣1(II) chain and the ␣1(V) chain (4 -17). If this conclusion is correct, there should be a similarity in the amino acid sequences and in the distribution of charged amino acids among the ␣ chains (see Ref. 32). The results here demonstrated that a high degree of homology in the sequences and in the pattern of charged amino acids was in fact found between the ␣2(XI) chain and the ␣1(XI) and ␣1(V) chains. Therefore, the results were consistent with the chemical data indicating that heterotrimeric molecules can be comprised of varying combinations of ␣1(XI), ␣2(XI), and type V collagen chains. However, there was relatively little conservation of sequence and of distribution of charged amino acids with the ␣3(XI)/␣1(II) chain. Therefore, if the ␣3(XI)/␣1(II) chain is incorporated into the same molecule as ␣1(XI) and ␣2(XI) chains, the resulting triple helical molecule must be far more heterogeneous in the distribution of amino acid side chains that are on the surface of the molecule and  that direct fibril assembly than any better characterized molecule of a fibrillar collagen (see Ref. 32). Because of the differences among the ␣ chains observed here, it appears that further documentation of the chemical structure of the molecules containing ␣3(XI)/␣1(II) chains is now warranted.
The C-propeptide of the pro␣2(XI) chain showed a relatively high degree of homology with the C-propeptides of other fibrillar collagens (39). However, there was a series of internal deletions that made the chain shorter by about 30 amino acids. Apparently, the deleted 30 or so amino acids are not critical for directing chain association and chain selection.
Recent analyses on the mouse COL11A2 gene demonstrated that the gene was located at the 3Ј-end of the gene for the retinoic X receptor ␤ and that the two genes were close together. 1 The results here demonstrate that a similar arrangement of the two genes is maintained in the human genome. In addition, there is a high degree of conservation of the intergenic sequences. Therefore, the results suggest that there may be some functional consequences of this unusual arrangement of genes. The presence of four MAZ sequences at the 3Ј-end of retinoic X receptor ␤ gene apparently prevents continuous transcription of the two genes, much as has been found in at least three other pairs of genes that are in a similar head-to-tail arrangement (see Ref. 28).