Isolation and structure of the COL4A6 gene encoding the human alpha 6(IV) collagen chain and comparison with other type IV collagen genes.

The genes COL4A5 and COL4A6, coding for the basement membrane collagen chains, α5(IV) and α6(IV), respectively, are located head-to-head in close proximity on human chromosome Xq22, and COL4A6 is transcribed from two alternative promoters in a tissue-specific fashion (Sugimoto M., Oohashi T., and Ninomiya Y.(1994) Proc. Natl. Acad. Sci. U. S. A. 91, 11679-11683). Immunofluorescence studies using α chain-specific antibodies demonstrated that the two genes are expressed in a tissue-specific manner (Ninomiya, Y., Kagawa, M., Iyama, K., Naito, I., Kishiro, Y., Seyer, J. M., Sugimoto, M., Oohashi, T., and Sado, Y.(1995) J. Cell Biol. 130, 1219-1229). We report here for the first time the isolation and the structural organization of the human COL4A6 gene. The entire gene presumably exceeds 200 kilobase pairs and contains 46 exons. Exons 1′ and 1 encode the two different 5′-UTRs and the two amino-terminal parts of of the signal peptide. The carboxyl part of the signal peptide and the 7 S domain are coded for by the following 6 different exons, 2-7, whereas the exons 7-42 encode the central COL 1 domain, which contains the Gly-X-Y repeats. The last three exons, 43-45, encode the carboxyl-terminal NC1 domain. Sizes of more than a half of the exons of the gene are the same as those of Col4a2 but quite different from those of COL4A5. Within the COL4A6 gene we found three CA repeat markers that can be used for allele detection. The detailed structure of the COL4A6 gene and the high heterozygosity microsatellite markers located within the gene will be useful for linkage analysis and familial diagnosis of diseases caused by mutations of this gene.

The genes COL4A5 and COL4A6, coding for the basement membrane collagen chains, ␣5(IV) and ␣6(IV), respectively, are located head-to-head in close proximity on human chromosome Xq22, and COL4A6 is transcribed from two alternative promoters in a tissue-specific fashion ( Within the COL4A6 gene we found three CA repeat markers that can be used for allele detection. The detailed structure of the COL4A6 gene and the high heterozygosity microsatellite markers located within the gene will be useful for linkage analysis and familial diagnosis of diseases caused by mutations of this gene. Epithelial cells present a sheet-like extracellular structure, basement membrane (BM), 1 at their basal surfaces. This struc-ture appears as a ϳ100-nm-thick layer, consisting of type IV collagen, laminin, heparan sulfate proteoglycan, and some other glycoproteins (1). Since BMs are attached to the underlying extracellular matrix by anchoring fibrils formed by type VII collagen and microfibrils, there may be functional interactions between matrix molecules. On the other hand, BMs also play a crucial role in adhesion of epithelial and other types of cells, differentiation of a variety of cells, and tissue repair (2). BMs are known to be affected in certain disease states. For example, the epitope for the circulating antibody for Goodpasture syndrome that causes glomerulonephritis and pulmonary hemorrhage has been shown to be on the carboxyl-terminal part of the NC1 domain of the ␣3(IV) collagen chain (3). Also, a hereditary Alport syndrome characterized by sensorineural hearing loss, progressive glomerulonephritis, and, occasionally, ocular defects has been shown to be caused by mutations of the gene COL4A5, coding for the ␣5(IV) collagen chain (4).
Type IV collagen was initially thought to be a heterotrimeric molecule of two ␣1(IV) chains and one ␣2(IV) chain. But recent biochemical investigation and the molecular biology approach have made possible the identification of additional four different ␣(IV) chains, ␣3, ␣4, ␣5, and ␣6 (5-7). However, almost nothing is known about how these chains are involved in the assembly of the individual molecules. The newest chain, ␣6(IV), was found by characterizing the region upstream of the neighboring gene, COL4A5 (8), and by cross-hybridization with the homologous gene, COL4A4 (7). Of interest is not only that the two most recently discovered genes, COL4A5 and COL4A6, are arranged in a head-to-head fashion on chromosome Xq22 but also that the other set of "new" genes for collagen IV, COL4A3 and COL4A4, are located on chromosome 2 (9, 10) in a presumably similar fashion as the former two (11). Furthermore, ␣1(IV) and ␣2(IV) chains are known to exist in all BMs, and ␣3(IV) and ␣4(IV) chains appear to co-localize in certain BMs but not all; whereas the ␣5(IV) and ␣6(IV) chains are not necessarily distributed together in some tissues (12). Differential expression of the latter two genes could be explained by the transcription from the two alternative promoters in the COL4A6 gene (13). Precise characterization of the COL4A6 gene is essential for studies on pathogenesis of certain diseases such as Alport's syndrome associated with diffuse esophageal leiomyomatosis (14). In the present study, we describe the isolation and characterization of the COL4A6. The gene harbors 46 exons, spanning at least 200 kb in size. We also determined the sequences of the exon/intron junctions of the entire gene.

MATERIALS AND METHODS
Isolation of Genomic Clones-Total EMBL3-genomic library (HL1111J) from human female leukocytes was purchased from Clontech Laboratories Inc. The library was screened with the previously isolated cDNAs encoding the human ␣6(IV) collagen chain (7). The * This work was supported in part by Monbusho International Science research Program (Joint Research) 07044268 (to Y. N.); a Grantin-aid for Encouragement of Young Scientists 06770869 (to T. O.), and a Grant-in-aid for scientific research 07671251 (to T. O.) from the Ministry of Education, Science, and Culture of Japan; Grant EY07334 from the National Eye Institute; a grant for the promotion of science from the CIBA-Geigy Foundation (Japan); and a grant from The Nakatomi Foundation. The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBank TM /EMBL Data Bank with accession number(s) D63525 to D63568. ¶ To whom correspondence should be addressed. Tel.: 81-86-223-7151 (ext. 2390); Fax: 81-86-222-7768; E-mail yoshinin@ceews2. cc.okayama-u.ac.jp. 1 The abbreviations used are: BM, basement membrane; SSC, 0.15 M screening of the library was performed by standard procedures (15). 13 purified genomic clones were characterized by genomic mapping, and necessary restriction fragments were subcloned into pBluescript II vectors for further characterization. Genomic DNA preparation was performed by the standard procedures (15). DNA Sequencing-DNA sequencing was carried out by the dideoxynucleotide sequencing method (16,17) using double-stranded DNAs as template. We used fluorescent dye-labeled terminators and the cycle sequencing method (Perkin Elmer). Universal primers were used for the sequencing reaction. We used a 373S automatic sequencer (Perkin Elmer). In some cases we synthesized specific primers for sequencing. Specific nucleotides were synthesized in an Applied Biosystems DNA synthesizer, 394 (Central Research Laboratory, Okayama University Medical School).
Repetitive Sequences and Allele Detection-To find CA repeats within the COL4A6 gene, we screened five genomic clones, already isolated for the human ␣6(IV) chain, AF4, 17, 18, 30, and YU2, with 32 P-labeled (CA) 9 probe (18). Restriction fragments containing CA repeats were subcloned into pBluescript II and sequenced by the dye-terminator method. Polymerase chain reaction primers were designed at the flanking regions of the CA repeats (Table I) . Amplification conditions were as follows: incubation for 2 min at 94°C followed by 29 cycles of 1 min at 94°C, 2 min at 55°C, and 3 min at 72°C, with a final elongation step of 72°C for 7 min. 1-l aliquots of samples were denatured at 95°C, electrophoresed on 6% polyacrylamide sequencing gel, and autoradiographed overnight at Ϫ70°C. Allele sizes were determined by measuring migration distance of the coelectrophored size markers.

RESULTS AND DISCUSSION
Isolation of Genomic Clones-We have previously described (13) that COL4A5 and COL4A6 have a unique arrangement in that they are co-localized on chromosome Xq22 in a head-tohead fashion and appear to share a common bidirectional promoter. In addition we reported a novel observation that the COL4A6 gene is transcribed from two alternative promoters in a tissue-specific manner (13). In this present study, we analyzed the structure of the COL4A6 gene by isolation and characterization of phage clones. We used previously isolated cDNAs, TM51, 46, 30, and 29, and screened the genomic library. We isolated 50 clones all together, but only 13 of them, AF 2, 4, 17, 18, 27, 30, 31, 35, 39, 40, 47, 48, and YU2 (Fig. 1), that cover exons were characterized further.

CTG AGT CAT CTA ATG GGA TG
Structure of the COL4A6 Gene-The COL4A6 gene contains 46 exons including exon 1Ј at the 5Ј end. Since exon 1 and 1Ј are alternatively transcribed presumably from alternative promoters (13), we named the following exons from exons 2-45. Fig. 1 shows the relative location of the 46 exons. Since the five 5Ј genomic clones did not overlap each other, there were four intronic gaps between the clones; however the fragments for the entire coding region were all isolated. The 13 clones span approximately 150 kb of the structural gene. Several intron regions (introns 2, 3, 5, and 7) are not covered by these 13 clones. Considering the previous data on introns 2 and 3 characterized by pulse-field gel electrophoresis (14) together with the results presented here, we estimate the total length of the COL4A6 gene to be not shorter than 200 kb. The neighboring gene, COL4A5, also represents a long gene, at least 140 kb (20). The IV collagen genes (COL4A1, 100 kb (21); COL4A2, 100 kb (22); COL4A5, Ͼ140 kb (20)) seem to be in general larger than fibrillar collagen genes, i.e. COL1A1, 39 kb; COL1A2, 38 kb (23); COL2A1, 35 kb; and COL3A1, 35 kb (24). Some other nonfibrillar collagen genes in other species have been demonstrated to be large, i.e.chicken ␣1(IX) gene, Ͼ100 kb (25) and human COL13A1, 140 kb (26).
Exon Sizes and Exon/Intron Boundaries-The exon 1Ј encodes the 5Ј-untranslated region (UTR) of the gene and the 5Ј part of the signal peptide (5 amino acid residues). Similarly exon 1 codes for another 5Ј-untranslated region and another 5Ј part of the signal peptide (4 residues), resulting in two different signal peptides (13). The following exons of 2-7 code for the 7 S domain and of 7-43, for the central COL 1 domain. The carboxyl NC1 domain is encoded by the last three exons, 43, 44, and 45. We did not determine whether there are more exons farther in the 3Ј direction. The exons are distributed sparsely at the 5Ј-half of the gene but rather densely at the 3Ј-half, as seen in Fig. 1.
The exon sizes vary between 36 and Ͼ980 bp, but if the 5Јand 3Ј-untranslated region and the NC1 domain are excluded, they vary from 36 to 222 bp (Fig. 2). Exon sizes of the COL4A6 are summarized and compared with those of other type IV collagen genes, COL4A5 (20), Col4a2 (27), COL4A2 (28), and COL4A4 (29) (Fig. 3). Exon sizes of the gene are rather small; most of them are smaller than 150 bp. Especially exons coding for only Gly-X-Y repeats, characteristic for collagen, are smaller. Not many exons represent 54 or 45 bp in size as seen in fibrillar collagen genes (23).
Exon sizes of the COL4A6 gene were compared with those of the other type IV collagen genes. Exons 4 -42 all encode Gly-X-Y repeats. Many of them code for imperfection of Gly-X-Y repeats. As shown in Fig. 3, sizes of many exons coding for the COL 1 domain in the COL4A6 gene are the same as those of the Col4a2 , COL4A2, and COL4A4 genes (numbers are shaded); however, none of the exons encoding the corresponding domains in COL4A5 were the same sizes as in the COL4A6 gene. This indicates that the COL4A6 gene is more closely related to COL4A2 and COL4A4 than to the genes coding for the oddnumbered ␣(IV)chains, ␣1, ␣3, and ␣5 (29). The NC1 domain of the ␣6(IV) chain is encoded by three different exons, 43, 44, and 45. This pattern is common for isolated and characterized evennumbered genes, COL4A2, Col4a2, and COL4A4, whereas COL4A1 and COL4A5 encode the NC1 domain by means of five different exons.
In the previous study on the cDNAs we did not notice a very small EcoRI fragment. However, when we were comparing the nucleotide sequences between cDNA and genomic DNA, we identified a 36-bp EcoRI fragment located in the exon 39 (see Fig. 1). It contains AATTCCTGGACCTAAAGGGCCTAAGG-GAGACCAAGGG, which should be between G 4124 and A 4125 of the cDNA sequence in the previous paper (12). This makes amino acid sequence (G)IPGPKGPKGDQG(I) (amino acid residues in parentheses indicate that the 36 bp contained a part of coding nucleotides).
All of the exon/intron boundary sequences are shown in Fig.  2. As highlighted, dinucleotides at the beginning (gt) and ending (ag) of introns are all conserved. The first 15 exons, 3-17, of the COL4A6 gene begin with an intact glycine codon, whereas all of the last 38 exons start with a two-thirds intact glycine codon except for the three exons (22, 23, and 32), which start with the intact glycine codon. Split glycine codons were also found in exons of other collagen genes (30,31). It is intriguing that the genes all code for collagens with imperfections in the Gly-X-Y repeats and that the glycine codon is almost always split after the first G. This type of split codon is also found in noncollagenous genes that harbor Gly-X-Y repeat sequences with imperfections, such as those encoding mannose-binding protein (32), lung surfactant apoprotein (33), acetylcholineesterase (34), complement C1q B chain (35).
Repetitive Sequences in COL4A6 -To find repetitive CA sequences, we hybridized the 5 phage clones, AF4, 18, 17, and 30, and YU2, that cover 50 kb from the 3Ј end of COL4A6 (Fig. 1) with a (CA) 9 probe. Three clones, AF18, 17, and 30, were positive. Further analysis revealed three repetitive sequences, A6YU1, A6YU2, and A6YU3 (Table I), which are located in introns 25, 31, and 45, respectively. To determine polymorphic variations of the three microsatellites in the general population, we analyzed genomic DNAs isolated from 50 unrelated Japanese females, since the COL4A6 gene is located on chromosome X. Polymerase chain reaction products from the flanking primers were analyzed on sequencing gels. As indicated in Table II, one of the three repeats, A6YU2, displayed five alleles with 0.76 heterozygosity, and 0.26 PIC (polymorphism information content), whereas the other two, A6YU1 and A6YU3, showed 2 alleles with 0.34 heterozygosity, 0.65 PIC and 4 alleles with 0.42 heterozygosity, and 0.36 PIC, respectively. The combined heterozygosity was 0.82. These results indicate that the three markers are useful to distinguish alleles in relation to the COL4A6 gene.
More than 60 mutations in the COL4A5 gene have been identified in X-linked Alport syndrome patients (36). Recently, a new report showed mutations in COL4A3 and COL4A4 genes from autosomal recessive-type patients (37). Intriguingly, deletions of the 5Ј part of both COL4A5 and COL4A6 genes have also been revealed in seven patients with Alport's syndrome associated with diffuse leiomyomatosis (14); however, how the COL4A6 gene is involved in the pathogenesis of diffuse leiomyomatosis has not yet been clarified. Therefore the high heterozygosity microsatellite markers located within the COL4A6 gene will be a useful tool for linkage analysis and familial diagnosis of the diseases caused by COL4A6 gene mutations.
In conclusion we reported the isolation and structure of the COL4A6 gene, which is aligned together with COL4A5 in headto-head fashion on chromosome X. The detailed structure of the COL4A6 gene determined in this study will be important for finding mutations in patients with X-linked Alport's syndrome and/or leiomyomatosis and in those with other diseases caused by the mutated gene.