Genomic Organization of the Human Mucin Gene MUC5B cDNA AND GENOMIC SEQUENCES UPSTREAM OF THE LARGE CENTRAL EXON*

The complete structure of the DNA encoding the polypeptide chain of human mucin MUC5B has been determined. In this paper, we report the full-length cDNA (3886 bp) and genomic (15,143 bp) sequences upstream of the unusually large central exon of the human mucin gene MUC5B. This region, composed of 29 exons, encodes 1283 amino acid residues. Exon sizes vary from 44 to 262 bp, and intron sizes range from 87 to 1703 bp. We determined the 5*-end of MUC5B by performing rapid amplification of cDNA ends-polymerase chain reaction experiments leading to the same length of the amplified product and by using primer extension experiments. A putative translation start site was found at nucleotide 137. We compared the amino-terminal region of MUC5B with those of pro-von Willebrand Factor, MUC2 and MUC5AC, and animal mucins, RMuc2, PSM, and FIM-B.1. The primary amino acid sequence with a high content of cysteine residues demonstrates a high degree of similarity with other members of the 11p15 mucin gene family, particularly MUC5AC. The complete genomic organization and both full-length genomic and cDNA sequences of MUC5B have been elucidated. This gene contains 48 exons and encodes 5662 amino acid residues to give a polypeptide with a Mr approximately 600,000.

Mucins are the major components of the mucus, the viscoelastic fluid that protects and lubricates the epithelial surfaces of the respiratory, digestive, and reproductive tracts. Mucins constitute a family of high molecular mass, polydisperse, highly glycosylated proteins. They are synthesized and secreted by specialized epithelial cells. Alterations of the biosynthesis of mucins affecting the protein core and/or the carbohydrate chains linked to the peptide have been observed in numerous pathological situations (1)(2)(3)(4). At least eight human mucin genes have been identified (for review, see Gendler and Spicer (5) and Ref. 6). Four of these mucin genes, MUC6, MUC2, MUC5AC, and MUC5B, have been mapped to 11p15.5 on a single band of 400 kilobases (7). Only the genomic structure of the two shortest human mucins, MUC1 (8,9) and MUC7 (10,11), have been reported. The central domains of the large apomucins MUC2, RMuc2, MUC5AC, FIM-B.1, and PSM, which contain tandem repeats rich in Ser, Thr, Gly, and Ala, are flanked at both ends by unique cysteine-rich domains (12)(13)(14)(15)(16)(17)(18)(19)(20)(21). These domains have significant amino acid sequence similarity to one another and to those in human pro-von Willebrand Factor (pro-vWF) 1 (22).
The MUC5B gene is expressed mainly in bronchus glands and also in submaxillary glands, endocervix, gall bladder, and pancreas (30 -34). We have previously subcloned and sequenced the unusually large central exon of 10,713 bp of MUC5B (35) and the 3Ј region (28). The limited data known until recently concerning the putative region upstream the tandem repeat array of MUC5AC (18) suggested that the amino-terminal region of MUC5AC could also be similar to that containing several D-domains of pro-vWF as in MUC2, PSM, and FIM-B.1 (12-14, 20, 21). This has been confirmed very recently (19,36). Hypothesizing that such a similarity also exists in the amino-terminal part of MUC5B, we performed RT-PCR using degenerate primers chosen in sequences that show a high degree of similarity between the known sequences in MUC genes and vWF gene. In this study, we present the genomic organization and the complete sequence (15,143 bp) of the region upstream of the central exon of MUC5B and the cDNA sequence encoding the amino-terminal region (1283 aa) of the apomucin. The comparison of the deduced amino acid sequence shows that MUC5B and MUC5AC have higher similarities.

EXPERIMENTAL PROCEDURES
Synthesis of Oligonucleotide Primers-Oligonucleotide primers used in PCR, RACE-PCR, RT-PCR, and sequencing experiments were synthesized by Eurogentec (Liège, Belgium). Their positions and sequences are indicated in Table I. 5Ј-End Amplification of cDNAs-The 5Ј-AmpliFINDER RACE kit (CLONTECH) was used to synthesize first-strand cDNA from human trachea poly(A) ϩ RNA (1 g) obtained from CLONTECH using NAU65 or NAU75 as first primer, followed by ligation of the 5Ј-anchor adapter. * This work was supported by "le Comité du Nord de la Ligue Nationale contre le Cancer" and "l'Association de Recherche contre le Cancer." The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBank TM /EBI Data Bank with accession number(s) AJ004862 (5Ј-genomic sequence). The PCR was then performed using the nested primer NAU66 or NAU76 respectively and the 5Ј-anchor primer. Nested PCRs (RACE66 and RACE76 in Fig. 1) involving a second or a third round amplification were carried out with 1 l of the reaction mixture obtained from each previous round of PCR as template. Another RACE-PCR (RACE367, Fig. 1) experiment was performed using NAU433 and then NAU367 as nested primer. These specific primers were designed from the sequences of progressively amplified products using RT-PCR. The last 5Ј RACE-PCR (RACE324, Fig. 1) was performed using the 5Ј/3Ј RACE-PCR kit (Boehringer Mannheim) and the primer NAU180 and purified using the High Pure PCR Product Purification Kit (Boehringer Mann-heim). Terminal transferase was used to add a homopolymeric A-tail to the 3Ј-end of the cDNA, and tailed cDNA was then amplified by PCR using NAU324 and the oligo(dT)-anchor primer. The PCR products were subcloned into pGEM T-vector (Promega) or PCR2-1 vector (Invitrogen).
RT-PCR Amplification-Total RNA (1 g) of human gall bladder was extracted as described previously (37). Single-stranded cDNA was performed using the First-strand cDNA Synthesis Kit (CLONTECH) and random hexamers. The 4-fold degenerate primer NAU105 (Table I)    is Cys-980), and HGM1 (where the first C residue is Cys-28). The 32-fold degenerate primer NAU101 (Table I) was based on seven amino acids, CGLCG(N/D)(F/Y), which occur in pro-vWF domains D1 (where the first C residue is Cys-159), domain D2 (where the first C residue is Cys-521), and MUC2 (where the first C residue is Cys-166). Both were utilized as sense primers. For the third RT-PCR, we used the primer NAU366 (see Table I) of which the sequence has been determined from the genomic sequence of the 1.3-kilobase SmaI-SmaI fragment hybridizing with NAU101 (in the left part of Fig. 2A). NAU76, NAU111, and NAU114 were used as antisense primers (Table I) for the PCR. The first-strand cDNA was amplified with the three pairs of primers NAU105/NAU76, NAU101/NAU111, and NAU366/NAU114 (Fig. 1) on a DNA Thermal Cycler 480 (Perkin-Elmer). PCR parameters were 94°C for 2 min, followed by 30 cycles at 94°C for 30 s, 60°C for 1 min and 72°C for 2 min, followed by a final extension at 72°C for 15 min. After elecrophoresis, the amplified products were purified (Wizard PCR Preps DNA Purification System, Promega) and subcloned into the pMOSBlue T-vector (Amersham). We confirmed the sequences of all these cDNA fragments using the Titan TM One tube RT-PCR kit (Boehringer Mannheim) on human trachea poly(A) ϩ RNA (CLONTECH) using various couples of specific primers. The double-stranded plasmid inserts were sequenced manually using [␣-35 S]dATP (Amersham) and Sequenase 2.0 (U. S. Biochemical Corp.) according to the protocol indicated by the manufacturer. Universal primers or a series of specific oligonucleotides (Table I) were used. The clones were sequenced on both strands several times. Sequence analyses were performed with help of the PC/GENE Software. Primer Extension-Primer extension was performed using two different primers end-labeled with [␥-32 P]ATP, NAU346 and NAU324, which correspond to cDNA positions ϩ93-ϩ116 and ϩ133-ϩ157, respectively. Total human tracheobronchial RNA and human poly(A) ϩ salivary mRNA (CLONTECH) were used. The primers were annealed to RNA, which was then extended by reverse transcription. The product was analyzed on a 6% polyacrylamide gel (Sequagel-6 TM , National Diagnostics) alongside a sequencing ladder, i.e. the sequencing reaction performed on M13 vector with the Ϫ40 primer using [␣-35 S]dATP (Amersham) as indicated above.
Restriction Mapping of the Cosmid BEN1-Cosmid BEN1 DNA was analyzed by restriction digestion and Southern blot as described previously (28,35) to identify useful DNA fragments and especially those hybridizing with the primers NAU105 and NAU101.
Characterization of Genomic DNA, Sequencing, and Sequence Analysis-We performed DNA sequencing directly on the cosmid BEN1 as described by Desseyn et al. (35) with the primers used for PCR. Sequencing was also performed as described above after subcloning into pBluescript II KS ϩ vector (Stratagene). The genomic sequence of the 5Ј-region of MUC5B reported in this paper has been submitted to the EBI data bank with accession number AJ004862.

RESULTS
Strategy for Cloning the cDNA of the Amino-terminal Region of MUC5B-In a previous work we performed a 5Ј RACE-PCR experiment and produced the RACE57 clone (985 bp), which allowed us to identify the 5Ј-end of the central exon and the intron just upstream of this exon (35). Our strategy is summarized in Fig. 1. Two overlapping clones were obtained using the RACE protocol: RACE66 and RACE76 of 600 and 200 bp, respectively. Then, using the RT-PCR with the degenerate oligonucleotides and specific primers, we obtained three overlapping fragments, 105/76 (516 bp), 101/111 (1425 bp), and 366/114 (1156 bp). To elongate the unknown nucleotide sequence encoding the amino-terminal portion of MUC5B, two other RACE-PCR experiments were performed using NAU433 and then NAU367 as nested primer (RACE367) and NAU180 and NAU324 as nested primer (RACE324). RT-PCR using various specific primers were performed to confirm the sequences (91/57, 253/367, 253/286, 366/334, Fig. 1).
Characterization of the Genomic Fragments of the 5Ј-End Region of MUC5B-We first determined the partial restriction map of the genomic clone BEN1. It is shown in Fig. 2A, together with the overlapping fragments, which were subcloned into pBluescript II KS ϩ vector to facilitate sequence analysis. The fragment, 375-bp HindIII-KpnI (in the right part of Fig. 2A), contains the 5Ј-end of the central exon (35) indicated with a bold line.
Transcription Initiation Site of MUC5B-Repeated RACE-PCR experiments to generate larger products using total RNA from trachea or gall bladder and the primers NAU324 and NAU346 yielded PCR products with identical sequence that were 157 and 116 bp, respectively, ending at the same nucleotide. Primer extension experiments were carried out using two primers, oligonucleotides NAU346 located in exon 2 and NAU324 located in exon 3. The location of transcription initiation was determined by comparison of the migration of the extension product with the sequencing ladder. After electrophoresis, one band for each extended primer was revealed (Fig.  3, for NAU346), supporting the view suggested by the RACE-PCR results that the start site is approximately 95 bp upstream of NAU346. The transcription start site, further designed ϩ1, was designated to be 36 nucleotides upstream of the translation initiation codon ATG which is embedded in a Kozak consensus sequence (38). The 5Ј-untranslated DNA sequence is CCTGTGGAGCCGAGCTGGGGGAATGCAGGGCACACC(ATG).
Sequencing Data and Genomic Organization of MUC5B-The 5Ј-region of human MUC5B encompasses 15,143 bp from the start site. Comparison between the genomic sequence and the cDNA sequence allowed us to determine the complete genomic organization of the region upstream of the central large exon of MUC5B. The 5Ј-region of MUC5B, depicted in Fig.  2B, is composed of 29 exons ranging in size from 44 to 262 bp (Table II). The first exon contains an UTR region of 36 bp, the translation start site, and the first three amino acids. As typical of secretory proteins, the sequence of MUC5B starts with a hydrophobic sequence, which is not indicated by the computer as a typical signal sequence. The 5Ј-region of MUC5B contains 29 introns ranging in size from 87 to 1703 bp. Table II summarizes the length and class of each intron.
The sequences at the exon/intron boundaries are highly conserved with respect to canonical acceptor/donor site (ag/gt) (39) except for the intron 8 that started with a gc instead of gt dinucleotide (see Table II). This intron is flanked on either side by the perfect hexanucleotide sequence CAGGCA (boxed in Table II reading frame encodes a 1283-amino acid peptide (Fig. 4) rich in cysteine (9.3%) and proline (6.5%). This region is relatively poor in serine and threonine (6.9 and 7.4%, respectively). Ten potential N-glycosylation sites (Asn-X-Ser/Thr; X, any amino acid except Pro) were found (shaded and in italic in Fig. 4) in the amino-terminal sequence of MUC5B.
Modules of the Amino-terminal Region of MUC5B-Comparison of the deduced amino-terminal MUC5B peptide with pro-

DISCUSSION
The studies reported here present the amino-terminal sequence of MUC5B and the exon-intron organization of this gene, which is now completely known from the transcription start site to the end of the gene. At ϩ37 is an ATG codon. The sequence surrounding this ATG matched well with the Kozak consensus sequence for predicted eukaryotic translation initiation sites (38). The results obtained using repeated RACE-PCR experiments and primer extension carried out on RNA samples from trachea and gall bladder confirmed that we cloned the complete 5Ј-region cDNA of MUC5B.
The 29 introns have an average length of 388 bp, this is in good agreement with previous surveys (40). Out of the 5Јsplicing donor sites, there was one site (in intron 8) containing gc instead of the more commonly used gt. A gc dinucleotide instead of gt has been detected in a number of genes (41,42), and in vitro and in vivo studies of the effects of this alteration on mRNA processing have shown that it allowed normal splicing.
Comparison of the vWF gene with MUC5B shows that their intron positions are highly conserved (Fig. 4). The 14 introns, 4 -7, 9, 14 -16, 19, 20, 22, 24, 25, and 27, of the vWF gene and the 14 introns, 3, 5-7, 9, 14, 16, 17, 20, 21, 23-25 and 27, of MUC5B show conserved positions and identical phase classes (Fig. 4). Nucleotide sequences of the introns of the two genes do not share any similarity showing that intron positions are much more highly conserved than intron lengths and sequences. The sequence overlapping the first four exons of MUC2 has been recently published (GenBank TM accession number U67167) (26). Our comparison between the genomic organizations of MUC2 and MUC5B shows that their intron positions and classes, downstream of the signal peptide, are highly conserved (Fig. 4), but intron lengths and sequences differ. Moreover, the tandemly repeated sequence of 38 bp (GGGTGGGTAGAGGCCCTCAGGCATGGGC(A/T)CGGGC(A/ G)GGT) in intron 2 of MUC2 was not found in the corresponding intron of MUC5B (intron 3). The amino-terminal region of MUC5B shares high levels of sequence similarity with the NH 2 termini of pro-vWF, MUC2, RMuc2, MUC5AC, PSM, and FIM-B.1. As shown in Fig. 4, nearly all the cysteine residues are conserved in the seven sequences compared. Out of the 424 highly conserved amino acid residues, the two more conserved are Cys (106 residues, i.e. 25%), which is important in tertiary structure, and Gly (54 residues, i.e. 13%), which may have a role in secondary structure. Interestingly, MUC5AC and MUC5B are more similar to each other than MUC5AC or MUC5B to the other peptides. These observations are consistent with (i) the evolutionary model we proposed before (43) for the three human mucin genes, MUC2, MUC5B, and MUC5AC, illustrating that they may have evolved from a common ancestor, and (ii) the fact that the genes MUC2, and its homologue RMuc2, MUC5AC, and MUC5B define a subclass of mucins characterized by the presence of subdomains called Cys subdomains (35,43) that interrupt several times their heavily glycosylated regions.
Similarities among the amino-terminal regions of the mucins and pro-vWF include short peptides (bold boxed in Fig. 4). The following positions are indicated in MUC5B. The hexapeptide TCGLCG (aa 165-170) and nonapeptide VCGLCGNFD (aa 982-990) contain the vicinal cysteine residue motif CGXC that have been proposed to play a role in disulfide interchain for-  mation (44). We can anticipate that D-domain found in the amino-terminal region of MUC5B region serves to form disulfide-linked multimers as shown recently on PSM (45). There are several dibasic residues in MUC5B that could serve as cleavage sites for peptidases. Two of these sites are highly conserved in mammalian mucins (aa 966 -967 and aa 1037-1038) or conserved between mucins and pro-vWF (aa 80 -81, with asterisks in Fig. 4).
To conclude, we have cloned and sequenced the region upstream of the central part of MUC5B and defined its exonintron repartition. With the three overlapping clones, ELO9, BEN1, and BEN2 (27,28,35), we have cloned the entire human MUC5B gene. The peptide deduced from the sequence of the 48 exons (Fig. 5) of MUC5B consists of a 5662-amino acid polypeptide with a M r approximately 600,000.