Human mucin gene MUC5B, the 10.7-kb large central exon encodes various alternate subdomains resulting in a super-repeat. Structural evidence for a 11p15.5 gene family.

Human mucin gene MUC5B is mapped clustered with MUC6, MUC2, and MUC5AC on chromosome 11p15.5. We report here the isolation of three overlapping genomic clones of human MUC5B spanning approximately 40 kilobases. We have determined their partial restriction maps and the intron-exon boundaries of the central region encoding a single open reading frame. This coding region has been completely sequenced. Its length is 10,713 base pairs, and it encodes a 3570-amino acid peptide. Nineteen subdomains have been individualized. Some subdomains show similarity to each other, creating larger composite repeat units that we have called super-repeats. Four super-repeats of 528 amino acid residues are thus observed within the central exon. Each comprises (i) a subdomain composed of 11 repeats of the irregular repeat of 29 amino acid residues, (ii) a unique conserved subdomain with no typical repeat, and (iii) a cysteine-rich subdomain. This latter subdomain has high sequence similarity to the cysteine-rich domains described in MUC2 and MUC5AC. Sequence data of these three genes, together with their clustered organization, lead us to suggest that they may be a part of a multigene family. The super-repeat present in MUC5B is the largest ever determined in mucin genes and the central exon of this gene is, by far, the largest reported for a vertebrate gene.

Mammalian respiratory, gastrointestinal, and reproductive tracts are protected by mucus secretions, of which the major components are the mucins. The mucins form a heterogeneous group of high molecular mass, polydisperse, highly glycosylated macromolecules. They are synthesized and secreted by specialized cells in the epithelium.
Considerable advances have been made over the past years toward our understanding of the structure and function of mucin glycoproteins. The isolation of mucin cDNA clones introduced a new approach to the structure of the mucins. Until now, at least eight human mucin genes have been identified (see Ref. 1 for review). The chromosomal localization of these genes has been established: four of them, MUC2, MUC5AC, MUC5B, and MUC6, are clustered on 11p15.5 between the HRAS and IGF2 genes, MUC1 is on 1q21-24, MUC3 on 7q22, MUC4 on 3q29, and MUC7 on chromosome 4q13-21. Recently, a cDNA called pAM1 has been cloned from a human tracheal library and localized on chromosome 12 (2). Three novel cDNAs have been reported: NP3a from a human nasal polyp library (3), L31 from a HT29-MTX cell line library (4), and HGM-1 from a human stomach library (5). Their sequences show that they correspond to some parts of the MUC5AC gene. More recently, a novel cDNA (pSM2-1) from human sublingual gland has been described (6). MUC1, which is developmentally regulated and aberrantly expressed by carcinomas, encodes a membrane-associated mucin-like glycoprotein. In contrast, the other described genes code for secreted mucins (1). Mucins present extended arrays of tandemly repeated sequences, producing a protein core rich in potential O-glycosylation sites and having a high content of serine, threonine, proline, glycine, and alanine. The tandem repeat units vary in length from as few as 24 bp 1 in MUC5AC (7) to 507 bp in MUC6 (8). The tandem repeat domain is flanked on either side by nonrepeat regions. In the MUC2 gene product, these tandem repeats are flanked by cysteine-rich subdomains of approximately 845 residues upstream and 700 residues downstream. Both cysteine-rich subdomains have sequences similar to the D-domains of human pro-von Willebrand factor (9,10). Some parts of these D-domains are also found in NP3a (3) and in HGM-1 (5). Moreover, the MUC2 gene has upstream to the tandem repeat a region of imperfectly conserved repeats flanked by another type of cysteine-rich domain (11). This latter domain has also been described in HGM-1 (5), twice in MUC5AC (12), and also in its related cDNAs (3,4).
Mucins are essential for the protective properties of the mucus (13,14). In addition it is becoming apparent that an abnormal expression of the mucin genes occurs in various disease states and in conditions associated with a high risk of adenoma or carcinoma (15)(16)(17). Therefore, the study of the factors responsible for the regulation of mucin expression is of great interest. With this goal, it is necessary to acquire a detailed knowledge of the genomic structure of the mucin genes. The complete sequences of MUC1 (18), MUC2 (11,9,19), and MUC7 (20) cDNAs have been described. However, only partial cDNA sequences are published for the other mucin genes. The whole genomic structure is only known for MUC1 (21), although the partial genomic organization of MUC7 has * This work was supported by "La Ligue contre le Cancer," "l'Association de Recherche contre le Cancer," and "l'Association Française de Lutte contre la Mucoviscidose." The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBank TM /EBI Data Bank with accession number(s) Z72496 (HSMUC5BEX). § The first two authors contributed equally to this work and should therefore be considered as equal first authors.
We have previously described four human tracheobronchial cDNA clones with degenerate 87-bp tandem repeats belonging to the human MUC5B gene (23). Our laboratory has focused considerable attention upon this gene which is expressed in mucous glands of tracheobronchial tissue, submaxillary glands, gall bladder, and endocervix (24 -26).
In this paper, we present the partial restriction map of the three overlapping genomic clones of the MUC5B gene spanning more than 40 kb and the complete sequence of its central exon which encompasses 10,713 bp and codes for a 3570-amino acid peptide. Nineteen subdomains are individualized. Some subdomains show similarity to each other, creating larger composite repeat units that we call super-repeats. In addition to the tandem repeat of 29 amino acid residues as described previously (23), which we now define as being irregular or imperfectly conserved, four repeats of 528 amino acid residues are observed within the central exon. Each comprises 11 repeats of the irregular repeat of 29 amino acid residues, a unique conserved domain of 111 amino acid residues with no typical repeat but rich in threonine, serine, and alanine and a cysteinerich region of 108 amino acid residues. This latter subdomain has sequence similarity with cysteine-rich domains of MUC2 and MUC5AC or its related cDNAs. This super-repeat is the largest ever determined in mucin genes.

EXPERIMENTAL PROCEDURES
Southern Blot Analysis-DNA from 20 healthy unrelated volunteers was prepared from leukocytes. It was digested with the following restriction endonucleases: BamHI, BglII, EcoRI, HindIII, KpnI, PstI, XbaI, and XhoI. Fragments were separated by electrophoresis in phosphate buffer through 1% agarose gel, transferred to a nylon membrane, and hybridized as described previously (23).
Library Screening-JER57, the longest cDNA of MUC5B isolated previously (23), was first used as a probe to screen a genomic EMBL4 phage library (12). One positive clone CEL5 was isolated and studied. Other screenings only gave the same clone.
To isolate larger genomic clones of the MUC5B gene, we screened a human placenta genomic DNA library in pWE15 cosmids provided by Stratagene using the JER57 probe. Two positive clones BEN1 and BEN2 were isolated and studied.
Restriction Mapping of Cosmids-The restriction mapping strategy of Wahl et al. (27) was slightly modified as follows. Cosmids were digested to completion with the restriction enzyme NotI. For each of the other restriction enzymes used, one part of this NotI-digested cosmid was digested to completion and a second part partially digested with the enzyme in order to generate a set of fragments that began at the T7 or T3 promoters and ended at the site of cleavage of the chosen enzyme. These NotI-terminated digestion products were fractionated on an agarose gel (0.6%) and blotted to Hybond™-N ϩ membrane (Amersham Corp.) by capillary blotting overnight. The fragments were then mapped relative to the T7 or T3 promoters by hybridizing the blot with end-labeled oligonucleotide-sequencing primers specific for these promoters.
5Ј RACE-The 5Ј AmpliFINDER RACE kit (Clontech, Inc., Palo Alto, CA) was used to synthesize first strand cDNA from 2 g of human tracheal poly(A) ϩ RNA obtained from Clontech with NAU58 as first primer (5Ј-TTGTAGCACATCTTGAAGACGCCC-3Ј, antisense nt 776 -799) followed by the ligation of the 5Ј anchor adapter. The RACE-PCR was performed in 50-l reaction volumes containing 5 l of 10 ϫ buffer, 5 l of 10 mM deoxynucleoside triphosphates, 2.5 l of reversed transcribed target cDNA, 10 pmol of each primer (NAU57, 5Ј-ACGGATC-CCCTGCACACCAGGCCGAAGTG-3Ј, antisense nt 741-762 with underlined nucleotides added in 5Ј to generate a BamHI restriction site and 5Ј anchor primer), and 1.5 units of Taq DNA polymerase (Boehringer Mannheim). After overlaying with 50 l of mineral oil (Sigma), the mixture was denatured at 94°C for 3 min followed by 30 cycles at 94°C for 1 min, 71°C for 1 min, 72°C for 2 min. The elongation step was extended for an additional 10-min period. Secondary amplification was performed using 2.5 l of the primary amplification product. The thermal cycling protocol used was the same as for the primary RACE amplification.
RNA Extraction-Total RNA was extracted from a human gall bladder using the guanidine isothiocyanate/CsCl method (28,29).
Reverse Transcription and Amplification-A sample (0.5 g) of human tracheal poly(A) ϩ RNA (Clontech) and a sample (1 g) of total RNA extracted from human gall bladder were reverse-transcribed with the 1st-STRAND™ cDNA synthesis kit (Clontech) using random primers according to the manufacturer's instructions. The thermal cycling protocol used was the same as the one described above except that the annealing temperature was 62°C. PCR experiments were performed using a Perkin-Elmer apparatus.
Cloning of Amplification Products-RACE-PCR products were separated by electrophoresis. Parts of the gel containing bands of interest were excised and the DNA was purified using Glassmilk (BIO 101, Inc.) and cloned into pGEMT vector (Promega). PCR products were purified using Preps DNA purification resin (Promega) and cloned into pMOSblue vector (Amersham).
Cloning in pKS-The fragments of interest from phage or cosmid clones were subcloned into the pBluescript KS(ϩ) vector from Stratagene.
Plasmid DNA Purification-We used the Wizard™ minipreps DNA purification system (Promega).
DNA Sequencing and Sequence Analyses-The clones were sequenced on both strands by the dideoxy chain termination method using ␣-35 S-dATP with Sequenase version 2.0 (U. S. Biochemical Corp.), Sequitherm (TEBU), or the T7 Sequencing TM kit (Pharmacia Biotech Inc.). They were sequenced using synthetic oligonucleotides corresponding to the T7 and T3 primers of the pKS plasmid, to the T7 and Ϫ40 primers of the pGEMT or pMOSblue vector. Part of the sequence was determined by primer walking using primers specific to the MUC5B gene. Analyses of nucleic acid and protein sequence data were performed using PC/GENE Software.
To perform DNA sequencing directly on cosmids (2 g), we had to anneal at 37°C for 30 min and to use 5 pmol of primers instead of 0.5 pmol when sequencing DNA in plasmids using the Sequenase version 2.0. The nucleotide sequence reported in this paper has been submitted to the EMBL Data Bank with accession number Z72496.

RESULTS
Southern Analysis-Human genomic DNA from leukocytes of 20 healthy unrelated volunteers was digested with BamHI, BglII, EcoRI, HindIII, KpnI, PstI, XbaI, and XhoI. The sizes of the fragments obtained and hybridized with the JER57 probe are indicated in Table I. These results indicated the fragments of interest recognized with the JER57 probe. We isolated all these fragments from a phage or a cosmid genomic library and sequenced them to obtain the complete sequence recognized with the JER57 probe (see below).
Isolation and Restriction Mapping of MUC5B Genomic DNA Clones-The EMBL4 human genomic library has been screened with the JER57 probe, and one positive clone with an insert of approximately 12 kb called CEL5 was obtained. The JER57 probe hybridized with one BamHI-BamHI fragment of 1.1 kb (indicated in Figs. 1 and 2A) situated in the 5Ј part of this clone. This fragment was completely sequenced. Other screenings of the EMBL4 genomic library only gave the same clone. This led us to screen a human placenta genomic DNA library in cosmid vector pWE15 using JER57 as probe. Two cosmid clones containing inserts of approximately 40 kb were obtained and called BEN1 and BEN2. The partial restriction map of the three clones (CEL5, BEN1, and BEN2) is indicated in Fig. 1. BEN1 contains a CpG island, since restriction sites such as BssHII and NotI are close together (30,31). This island is located at 24 kb from the 5Ј-end of the insert and corresponds to the I 8 CpG island on the macrocartography performed in the 11p15.5 region (32). BEN2 overlaps the 3Ј region of BEN1 on 16 kb and overlaps completely the CEL5 clone. We have found on these clones all the restriction fragments corresponding to the fragments recognized with the JER57 probe and observed on Southern blots of genomic DNA ( Fig. 1 and Table I).
Strategy for the Sequencing of the Cosmid Genomic Fragments Hybridizing with the JER57 Probe-We prepared, subcloned into pBluescript KS(ϩ) and sequenced all the fragments indicated in Fig. 2A at the top. Various restriction fragments derived from these clones were also subcloned to determine the entire sequence. We distinguished the fragments obtained from BEN2 only (with an asterisk). One fragment BamHI-BamHI of 1.1 kb was obtained from BEN2 and from CEL5 (noted with an asterisk and ϩϩϩϩ). Since some fragments had the same lengths but were at various positions, we had to carefully isolate them starting from larger fragments unambiguously positioned, obtained for example only from BEN2, or for others only from BEN1. Due to an extremely high proportion of GC residues, many sequence problems were encountered. Subclones were thus sequenced several times before a reliable sequence was obtained.
Determination of Intron-Exon Boundaries on Each Side of the Central Exon-To obtain cDNAs upstream of the central tandem repeat region we used 5Ј RACE-PCR with the primers NAU58 and NAU57 chosen in a nonrepeat region in 5Ј of the central region. The amplification products were analyzed by agarose gel electrophoresis. Four ethidium bromide-stained bands ranging from 0.4 to 1 kb were obtained (data not shown) and cloned into pGEMT vector. Several transformants were isolated and sequenced. The nucleotide sequence showed that the 3Ј-ends of all the clones sequenced overlap with the genomic sequence. The comparison of the sequence of the longest cDNA (985 bp) with the sequence of the fragment BamHI-KpnI of 1.35 kb (noted with a vertical arrow on the left in Fig.  2A) from the cosmid BEN1 indicates an intron of 468 bp (nucleotide sequence not shown), thus identifying the 5Ј-end of the central exon. Splice acceptor and donor sequences agree with the "GT-AG" rule (33).
To determine the 3Ј-end of the central exon we performed RT-PCR using two primers (NAU128 and NAU71, their positions are shown on Fig. 2A) on human RNAs from tracheobronchial tissue and from gall bladder. Then we compared the sequences obtained after subcloning into pMOSblue vector with those determined on fragments of the cosmid located 3Ј of the central exon: fragment BamHI-SacII ϳ 1.7 kb and PstI-SacII ϳ 1.3 kb (noted with vertical arrows on the right in Fig.  2A). The presence of an intron at 27 bp downstream from the site PstI evidenced by comparing the sequences of the two cDNAs obtained with the genomic sequence. Splice acceptor and donor sequences agree with the GT-AG rule (33).
In order to confirm that the sequence was in the correct order, we performed restriction maps of overlapping fragments such as the 8-kb KpnI-KpnI fragment from BEN2 or the 5.5-kb HindIII-NotI 3Ј-end fragment from BEN1 (Fig. 1).
To confirm that this sequence forms only one exon, additional PCR products were produced using various pairs of primers (see Fig. 2A). We compared the lengths and the sequences of the fragments obtained first by RT-PCR on gall bladder or tracheobronchial RNA and by PCR on genomic DNA (cosmids BEN1 or BEN2). Using NAU81 and NAU112 we obtained fragments of ϳ2500 bp, using NAU113 and NAU112 fragments of ϳ1400 bp, using NAU82 and NAU136 fragments of ϳ2100 bp. In each case, we obtained the same lengths and the same sequences with either cDNAs or genomic DNA as matrices. No intron was detected in this way. Thus the sequenced region consists of a single long open reading frame which extends over 10,713 bp (submitted to the EMBL Data Bank with accession number Z72496) coding for 3570 amino acid residues and leading to a polypeptide core with a calculated M r of 370,000.
Comparison with the cDNAs Described Previously-The nucleotide sequences of the previously isolated cDNAs JUL10, JER28, and JUL7 (23) have been positioned within the central exon. JUL10 is at position 5272-6258. JUL7 overlaps JER28. They are respectively positioned at 8510 -10143 and at 9172-9732. Some small differences in sequence were observed between the genomic sequence determined here and these cDNAs as determined previously by us (23). As we did not succeed to position the JER57 sequence as reported by us (23), we redetermined its nucleotide sequence (accession number X74955 with correction submitted to the EMBL Data Bank). The reason for this discrepancy between this newly determined sequence of JER57, which is at position 8361-10222, and our previous work is explained under "Discussion." These cDNAs were obtained by screening a human tracheobronchial tissue cDNA expression library using antibodies (23).  Fig. 3. Their average amino acid composition is given in Table II (column Cys). These subdomains are rich in cysteine residues (9.3%). In Fig. 3 is also shown the similarity between these cysteine-rich subdomains and homologous domains found in other mucins such as human MUC5AC or related cDNAs (3,5), mouse Muc5ac (34), pig gastric mucin (35), human MUC2 (9), and rat MUC2 homologue (36). In addition to the remarkable conservation of the cysteine residues, the conservation of numerous amino acid residues (bold boxes or boxes in Fig. 3) such as tryptophan, proline, arginine, and glycine is to be noted, especially between MUC5B and MUC5AC and its related cDNAs. There is one conserved potential O-glycosylation site in each of these cysteine-rich subdomains (boxed together with a tryptophan residue in the upper part of Fig. 3). No potential N-glycosylation site exists. The Cys4 to Cys7 subdomains of MUC5B mucin present a perfect sequence similarity except in three positions. The average amino acid composition of the seven cysteine-rich subdomains (see column Cys in Table II) is very similar to that of domain III in HGM-1 (5) (Fig. 3). It is noticeable that all but 2 of the cysteine residues of the central exon are in the cysteinerich subdomains. 31 out of the 32 tyrosine residues present in the central exon are in the cysteine-rich subdomains.
The first two subdomains between Cys1-Cys2 and Cys2-Cys3, subdomains R01 (64 amino acid residues) and R02 (174 amino acid residues), respectively, are enriched with threonine, serine, proline, and alanine (Table II), but no typical repeats or even imperfect repeats can be discerned. Searching of the Gen-Bank™ data base indicated that these sequences were not identical with any registered sequence. A certain similarity with other mucins, especially MUC1 or MUC2, is due to the typical amino acid composition of mucins.
Five subdomains (RI to RV) composed of various numbers of 87-bp imperfect tandem repeats are encountered downstream of the regions coding for the cysteine-rich subdomains Cys3 to Cys7. The alignment of the corresponding nucleotide sequences is shown in Fig. 4A. Domains RI, RII, and RIV are very similar. Each contains 11 imperfectly conserved repeats of 87 bp. The comparison is more obvious when looking at the deduced amino acid sequences (Fig. 4B). The subdomain RIII contains the same 11 irregular repeats of 29 amino acid residues from RIII-6 to RIII-17 (except RIII-7). Moreover, RIII-1 to RIII-5 are very similar to the first five repeats of subdomains RI, RII, and RIV.
The subdomain RV is composed of 22 irregular repeats. RV-1 to RV-5 are also very similar to the first five repeats of the subdomains RI to RIV. RV-6 to RV-17 (except RV-7 and RV-9) are similar to the 11 imperfectly conserved repeats of the other R-subdomains except for the insertion of the pentapeptide PTTTT in RV-14. Beneath the alignment is indicated the consensus sequence of this imperfectly conserved repeat and the percentages of the indicated amino acid residue(s). One serine residue has 100% conservation and 6 amino acid residues have more than 80% conservation. As far as all the five R-subdomains are concerned, 44  residues. The amino acid composition (Table II) shows an extremely high threonine content (37%) and high serine, proline, and alanine contents. There are numerous potential O-glycosylation sites. The sequence TXXP, which has been implicated as a major site for GalNAc addition, is found in most of the 72 repeats (5,37). These regions are likely to be heavily glycosylated. Moreover, one potential N-glycosylation site exists in each R-subdomain, and it is noticeable that the surrounding sequences are exactly the same for the potential N-glycosylation sites situated in RI, RII, RIII, and RIV (TPPVP N TTATT).
Concerning the sequences called RI-end, RII-end, RIII-end, and RIV-end, the alignment of their nucleotide sequences (Fig.  5A) shows a striking similarity. A consensus sequence is indicated. The amino acid sequences are aligned in Fig. 5B. The 111 amino acid residues present a perfect sequence similarity to each other except in eight positions. Searching of the Gen-Bank™ data base indicated that these sequences are not identical with any registered sequence. A certain sequence similarity with other mucin genes is only due to the high percentage of threonine, serine, proline, and alanine (Table II) encountered in these sequences. There are numerous potential O-glycosylation sites and five TXXP sequences are present in each R-end subdomain.
The central exon ends with a 75-amino acid peptide that we have called R03, since it consists of a sequence different from those of the other subdomains of the MUC5B central exon. It is enriched in threonine, serine, proline, phenylalanine, and valine (Table II), but no typical repeat exists. It is different from the sequences found at the 3Ј-ends of the tandem repeats of MUC5AC (3,4) and of MUC2 (9).
Close examination brought to the foreground the existence of a super-repeat found four times within the MUC5B central exon that we have called UpA, UpB, UpC, and UpD (Fig. 2B). Restriction mapping and sequence determination have shown that the BamHI digestion sites approximately flank the superrepeats (Fig. 2A). The BamHI digestion sites are quite regularly distributed within the central exon. This super-repeat consists of a R-subdomain with 11 imperfect repeats of 29 amino acid residues followed by a R-end subdomain and a cysteine-rich subdomain. It contains 528 amino acid residues.
The alignment of these four super-repeats has been performed (not shown since the alignments of the various sequences existing in these super-repeats have already been shown in Figs. 3, 4B, and 5B). They present a striking similarity except in the 11 positions noticed before (in the R-end and the cysteine-rich subdomains) and in 54 positions in the first part of the superrepeat composed of the 11 imperfectly conserved repeats of 29 amino acid residues. DISCUSSION We report here the isolation of three overlapping genomic clones of the human MUC5B mucin, one from a phage library (CEL5) and two from a cosmid library (BEN1 and BEN2) using the longest cDNA of MUC5B that we had previously isolated in the laboratory and called JER57 (23). MUC5B has been mapped clustered with MUC6, MUC2, and MUC5AC to chromosome 11p15.5. The partial restriction map of the three overlapping clones has been determined. We have also determined the intron-exon boundaries of the central region encoding a single open reading frame. This coding region has been completely sequenced. Its length is 10,713 bp, and it codes for 3570 amino acid residues. MUC5B is expressed in mucous glands of tracheobronchial tissue, submaxillary glands, gall bladder, and endocervix (24 -26). RT-PCR experiments carried out on total RNA from human gall bladder or on human tracheal poly(A) ϩ RNA unambiguously demonstrated that this open reading frame consists of a single exon and does not contain any intron. It was not possible to amplify the complete length of this open reading frame by RT-PCR using the two primers, NAU81 and NAU82, that are on both ends of the central exon. However, by performing different amplifications using more closely situated primers, in particular in the MUC5B specific part of the cysteine-rich domains, we were able to produce fragments exhibiting the same lengths and the same sequences as those in the cosmid clones. No difference was found in the common sequences determined on the genomic clones from the two libraries, even though they came from two unrelated individuals. This gene differs from MUC1 and MUC2 for which variable numbers of randem repeats polymorphisms have been demonstrated (38,11). Since the repeat unit of 87 bp is imperfectly   conserved, we had to sequence the whole central exon, while in MUC2 the authors were able to extrapolate the sequence repeated up to 100 times (9,11). The length of the central exon is much larger than that of MUC1, which varies from 3.5 to 6.2 kb, depending on the number of repeats (38). MUC7, a nonforming-gel mucin, possesses a central exon of 2.2 kb (22). As far as MUC2 is concerned, the intron/exon distribution is not as yet known. Toribara et al. (11) have indicated that the 5Ј region of the GMUC clone isolated in their laboratory forms, together with Nineteen subdomains have been individualized. Seven sub-domains of 108 amino acid residues, called Cys1 to Cys7, contain 10 cysteine residues and show a very similar organization to that observed twice in human (11) and rat (36) MUC2, at least twice in human (12), and mouse (34)  In A, identical nucleotide sequences are shaded, and a consensus sequence is given: R is A or G, Y is T or C, M is A or C, and S is C or G. The residues of serine and threonine are shadowed in B. and L31 (4). The structure of the cysteine-rich subdomains has been conserved over a long evolutionary time scale, and the evolutionary constraint has likely been maintained because of crucial disulfide bonds. Thus, this typical cysteine-rich domain seems to be a feature of at least three of the four human mucin genes located on 11p15.5, little information is available as to whether there is a similar domain in MUC6 (8). It is more and more tempting to speculate, as have Toribara et al. (8), that at least MUC2, MUC5AC, and MUC5B genes on chromosome 11p15.5 are part of a multigene family whose members code for secreted mucins. The conservation of numerous amino acid residues in addition to the cysteine residues suggests an important role for these domains. It is noticeable that the MUC5B and MUC5AC genes show greater similarity with each other than with the MUC2 genes (from human and rat). Deletions of amino acid residues occur at the same positions in MUC5B and MUC5AC, while some observed deletions are characteristics of the MUC2 type. Moreover, MUC5B and MUC5AC are very close together in the cluster (32). Overall, these observations suggest that MUC5AC and MUC5B may have the same ancestral gene.
It is noteworthy that the mouse mucin gene homologous to human MUC5AC is mapped to a site on mouse chromosome 7 homologous to the location of the human secretory mucin gene cluster on human chromosome 11p15.5 (34). We do not as yet know if mouse possesses both Muc5ac and Muc5b. Perhaps only one gene exists.
The presence of multiple cysteine residues strongly supports the idea that the MUC5B gene codes for a secreted mucin as disulfide bonding is necessary for the formation of a mucus gel. Moreover, we can postulate that interactions may occur via cysteine-rich domains between several mucin molecules producing a complicated network. This can explain the observations made using electron microscopy of bronchial mucins which have been shown to form complex entangled structures (41) or "bush-like" aggregates (42). The cysteine-rich domains may play a highly conserved role such as the packaging of the mucins or in mediating interactions with numerous other proteins such as, for example, the association with bilirubin in the gall bladder matrix (43). Further studies will be required to more adequately clarify the role of these domains.
An examination of the newly determined sequence of JER57 cDNA led us to conclude that the changes in the reading frame emphasized in our previous work (23) were in fact due to errors in sequence determination because of difficulties encountered in determining the sequence of such imperfectly conserved repeat domains. This is a problem that we have now overcome. We will now call the 87-bp repeat "imperfectly conserved" or "irregular" rather than "degenerate" repeat as designated previously. The cDNAs previously isolated (23) essentially consist of parts of the R-subdomain. In fact, the repeat of 29 amino acid residues of the R-subdomains appears to be only one of the components of the super-repeat. This super-repeat is the largest ever determined in mucin genes. The largest described until now was the 507-bp repeat unit of MUC6 (8). The super-repeat in MUC5B is more than three times as long as the MUC6 repeat. Its particularity is that it is made of three various subdomains. A variable number of this super-repeat has not been as yet observed. An interesting aspect of the deduced polypeptide sequence is the alternating arrangement of the three types of subdomains. The four R-end subdomains show a striking similarity to each other. Their role is likely to be crucial.
The conservation of the potential N-glycosylation sites within the repeat may be important for the proper maturation of MUC5B providing the positions at which N-glycosylation occurs. Recently, Denny et al. (44) reported that molecular cloning of mucin apoproteins showed that the consensus sequence(s) for N-glycosylation is usually found outside of the repeat domain. For MUC5B, five out of the seven potential N-glycosylation sites are within the repeat region. The subdomains R01, R02, and R03 display amino acid compositions typical of mucins but do not contain any repeat.
The MUC5B gene is clustered in 11p15.5 with MUC2, MUC6, and MUC5AC (8,32). Although mucins are characterized by having tandem repeat regions containing significant amounts of threonine, serine, alanine, and proline, the individual repeats units for each of these genes are very different in length and amino acid sequence. Moreover, the number of the cysteine-rich subdomains emphasized here, which are different from the D-domains of human pro-von Willebrand factor (10), considerably differs between MUC2 and MUC5B. The role of these two types of secreted mucins may be highly specialized as suggested by in situ hybridization experiments which show a different expression pattern for each gene (24). In fact, we will now be able to design new oligonucleotides and produce fusion proteins and antibodies in order to reinvestigate the expression of the MUC5B gene using in situ hybridization and immunohistochemistry.
Experiments performed to elucidate the entire genomic organization of MUC5B are now progressing as is the study of the regulatory regions. Whether the regulatory elements of MUC5B function coordinately with the regulatory elements of adjacent mucin genes in the cluster is an important question to answer in the future.