Molecular Cloning and Characterization of Lustrin A, a Matrix Protein from Shell and Pearl Nacre of Haliotis rufescens *

A specialized extracellular matrix of proteins and polysaccharides controls the morphology and packing of calcium carbonate crystals and becomes occluded within the mineralized composite during formation of the molluscan shell and pearl. We have cloned and characterized the cDNA coding for Lustrin A, a newly described matrix protein from the nacreous layer of the shell and pearl produced by the abalone, Haliotis rufescens, a marine gastropod mollusc. The full-length cDNA is 4,439 base pairs (bp) long and contains an open reading frame coding for 1,428 amino acids. The deduced amino acid sequence reveals a highly modular structure with a high proportion of Ser (16%), Pro (14%), Gly (13%), and Cys (9%). The protein contains ten highly conserved cysteine-rich domains interspersed by eight proline-rich domains; a glycine- and serine-rich domain lies between the two cysteine-rich domains nearest the C terminus, and these are followed by a basic domain and a C-terminal domain that is highly similar to known protease inhibitors. The glycine- and serine-rich domain and at least one of the proline-rich domains show sequence similarity to proteins of two extracellular matrix superfamilies (one of which also is involved in the mineralized matrixes of bone, dentin, and avian eggshell). The arrangement of alternating cysteine-rich domains and proline-rich domains is strikingly similar to that found in frustulins, the proteins that are integral to the silicified cell wall of diatoms. Its modular structure suggests that Lustrin A is a multifunctional protein, whereas the occurrence of related sequences suggest it is a member of a multiprotein family.

The molluscan shell and pearl are mineralized structured composites of CaCO 3 crystals and organic polymers exhibiting exceptional nanoscale regularity and strength (1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)(12)(13)(14). Nacre, the lustrous material of pearl and the inner "mother of pearl" layers of many shells, exhibits a fracture toughness ϳ3,000 times greater than that of the mineral alone (15,16). Although the organic components typically constitute only ϳ1% by weight of the biomineralized composite material (17), they are responsible for its organization and the resulting enhancement of fracture toughness (3-7, 18 -21). Proteins represent the majority of extracellular organic polymers controlling biomineralization of the shell (8 -10); they comprise at least four functional classes, including (i) a nucleating sheet that participates in control of nucleation of the first layer of oriented calcite in deposition of the abalone shell and flat pearl (13,14,22), (ii) a family of polyanionic proteins that can be extracted by demineralization of the shell (8 -10, 17, 23-25) and have been shown in vitro to control the polymorph and atomic lattice orientation by cooperative interaction with the growing crystals (22,26,27), (iii) proteins of an insoluble, highly cross-linked matrix, forming an organizing network of interconnected compartments and fenestrated sheets that control the morphology and higher order packing of the CaCO 3 crystals (8, 22, 28 -32), and (iv) functional enzymes, such as carbonic anhydrase (33), that apparently contribute to the control of biomineralization.
Nacre consists of layers of interlocking thin tablets of aragonite (one of the polymorphs of CaCO 3 ) typically 400 nm thick and 5-10 m wide surrounded on all sides by thin sheets of organic matrix approximately 30 nm thick (7,29,30,34). The thickness of the mineral tablets apparently is determined by the matrix sheets (22), while their interdigitation is controlled, in part, by the stochastic location of nanopores in the stencillike sheets through which the crystals grow from one layer to the next (32,(35)(36)(37)(38). Electron microscopy, x-ray, and electron diffraction have shown that each sheet consists of a chitin-like core sandwiched between two layers of protein (12, 28 -30). Diffraction data suggest that the protein sheets exhibit a ␤ conformation similar to that of silk fibroin (28), a finding supported by the fact that glycine and alanine are the most abundant amino acids (8,9,29,30). A glycine-and alanine-rich matrix protein recently was cloned from the nacre-producing tissue of the bivalve, Pinctada fucata (39), and a matrix protein rich in glycine, lysine, and valine has been extracted and characterized from the abalone, Haliotis rufescens (40). To understand the molecular mechanisms governing self-assembly of the matrix and biomineralization of nacre, one of us (A. M. B) has developed a procedure for extraction of other classes of matrix proteins from the nacre of H. rufescens shell and flat pearl with urea and SDS (31). Here we report the complete amino acid sequence of one of these proteins which we have named Lustrin A. The results reveal a new class of shell matrix protein, with a repeating modular structure containing sequences similar to those of two extracellular matrix protein superfamilies known from other animal tissues.

MATERIALS AND METHODS
RNA Purification-Rapidly growing specimens (75-100 mm) of red abalone (H. rufescens), a gastropod mollusc, were obtained from The Cultured Abalone Inc. (Goleta, CA). Total RNA was isolated from the mantle, muscle, gill, and stomach tissues using TriZOL reagent (Life Technologies, Inc.) and the modified TRI Reagent procedure for RNA isolation (41). The mantle pallial cell layer (responsible for secretion of the organic and inorganic precursors of the shell) was extracted by vigorous shaking with TriZOL reagent; the underlying muscle was discarded. Other tissues were homogenized in TriZOL reagent (stomach contents were removed before the tissue was homogenized). Poly(A) ϩ RNA from mantle tissue was isolated using the Poly(A)Tract mRNA isolation system IV (Promega Corp.) with the following modifications: annealed oligo(dT)-mRNA hybrids were washed with 0.2 ϫ SSC (0.03 M sodium chloride, 0.003 M sodium citrate) instead of 0.1 ϫ SSC; the elution of mRNA was performed at 40°C instead of at room temperature.
RT-PCR 1 Amplification-Degenerate oligonucleotide primers for PCR amplification were synthesized to correspond to portions of the N-terminal sequence and an internal sequence determined for one of the purified matrix proteins (Table I) (31). Degenerate primer D1 (GARCCNGGNYTRAAYGT) encoding EPGLNV, a part of the N-terminal peptide, and D2 (CARCANACNCCNGGYTT) corresponding to the antisense strand of the sequence encoding KPGVCC, a part of the internal peptide, were obtained from Cruachem Inc. Mantle mRNA (200 ng) was used for RT primed with a poly(dT) [12][13][14][15][16][17][18] primer; the reaction was catalyzed by Superscript II reverse transcriptase (Life Technologies, Inc.) in a 20-l reaction. The resulting cDNA then was diluted to 200 l with H 2 O, and 2 l were used in the PCR reaction which also included 2 M of each primer (D1 and D2), 1 ϫ AmpliTaq DNA polymerase buffer (Perkin Elmer), 2 mM MgCl 2 , 200 M dNTP, 5% (v/v) formamide, and 1 unit of AmpliTaq DNA polymerase (Perkin Elmer). PCR amplification (performed in a Cetus DNA Thermal Cycler from Perkin Elmer) conditions included an initial step at 94°C for 5 min followed by 40 cycles at 94°C for 1 min, 47°C for 1.5 min, and 72°C for 1.5 min. A final extension step of 72°C for 7 min was performed after the cycles. PCR reactions with only one degenerate primer (D1 or D2) were incubated in parallel as negative controls.
Construction and Screening of Mantle cDNA Library-Approximately 5 g of mantle poly(A) ϩ RNA, isolated from four animals, was used to construct the cDNA using the Stratagene ZAP Express cDNA synthesis protocol. Small cDNA molecules were removed by gel filtration over Sephacryl S-400. The cDNA then was inserted into the lambda ZAP Express vector and packaged with the Gigapack II Gold extract from Stratagene. The primary library contained 1.2 ϫ 10 6 plaqueforming units (PFU) and was subsequently amplified once to yield a titer of 2 ϫ 10 6 PFU/l with a frequency of nonrecombinant lambda vectors below 0.1%.
The library was first screened by PCR amplification. Two genespecific primers, GTCGTTGTGGAGTGCGT (G1) corresponding to nucleotides (nt) 65-81 of the 242-bp RT-PCR product (Fig. 1), and AC-CTCGAACACACCCAG (G2) matching the antisense sequence of nt 200 -216 of the same RT-PCR product, were obtained from Cruachem Inc. PCR screening reactions were carried out under the same conditions as described above with the following changes: no formamide was added to the reactions, the denaturation step was shortened to 30 s, annealing was done at 62°C for 1 min, and extension was performed for 45 s. The appearance of a 142-bp PCR product indicated the existence of a positive clone. Initially, approximately 5 ϫ 10 5 PFU from the mantle cDNA library were plated onto ten 150-mm NZY plates at a density of 50,000 PFU per plate. Phage particles from each plate were eluted into 8 ml of SM buffer (0.1 M NaCl, 10 mM MgSO 4 , 50 mM Tris-HCl, pH 7.5, and 0.01% w/v gelatin), and 1 l of each separate phage suspension was used in the PCR reaction. Ten PCR reactions were performed (one for each of the ten plates); nine generated the 142-bp product. Phage suspension from one of the nine positive plates was chosen randomly and diluted for secondary screening; the phage were plated onto twenty 100-mm NZY plates at a density of 2,500 PFU per plate. PCR reactions were performed for each of the 20 plates; four yielded the 142-bp product. Phage suspension from one of these four positive plates was again chosen randomly; further dilutions were made and PCR screenings were conducted sequentially as above. After four rounds of dilution and PCR screening, a single positive clone (designated clone 1) was isolated (Fig. 1). The 993-bp insert of clone 1 was labeled with digoxigenin-11-dUTP (DIG) using a Genius DNA random labeling kit (Boehringer Mannheim) and used to screen the library again. Hybridization was carried out at 42°C in 50% formamide, 2% blocking reagent (Boehringer Mannheim), 5 ϫ SSC, 0.1% sarkosyl, and 0.02% SDS. Filters were washed in 0.5 ϫ SSC and 0.1% SDS at 65°C for 1 h. Probe binding was detected using an anti-DIG Fab conjugated to alkaline phosphatase, and colorimetric detection was done with nitro blue tetrazolium and 5-bromo-4-chloro-3-indolyl phosphate according to the protocol recommended by Boehringer Mannheim.
Isolation of 5Ј-End of cDNA by PCR-Reverse transcription was carried out as described above using gene-specific primer G2. The sequence of G2 matches that of the antisense strand of nt 274 -290 of the longest clone (clone 5) isolated from the library (Fig. 1). A second gene-specific primer TCGAACACACCCAGTGGTTG (G3), corresponding to the antisense sequence of nt 268 -287 of clone 5, was used in the PCR reaction. The conditions for tailing and PCR reaction were as described previously (42).
DNA Sequencing and Sequence Analyses-Inserts of the clones were excised in vivo into the pBK-CMV phagemid vector. PCR products were subcloned into the pBluescript vector. They were sequenced by the dideoxynucleotide chain termination method (43) using Sequenase Version 2.0 (U. S. Biochemical Corp.). Oligonucleotide primers were synthesized (Operon Technologies, Inc.) to permit sequencing of both strands of the entire clone. Compressions resulting from secondary structures in the DNA were eliminated by replacing dGTP with dITP. Sequence data were analyzed with the Wisconsin Sequence Analysis Package by Genetics Computer Group, Inc.

Isolation of Lustrin
A cDNA Clones-The first Lustrin A cDNA clone was isolated using the RT-PCR strategy. Two degenerate primers, D1 and D2, were designed corresponding to the N terminus and an internal peptide sequence of Lustrin A (see "Materials and Methods"). A prominent 242-bp product was generated from mantle mRNA using these primers ( Fig. 1). Nucleotide sequences at the two ends of this fragment matched those of D1 and D2. The deduced amino acid sequences at the N and C termini of the amplified fragment were (EPGLN-V)NCTT and PPA(KPGVCC), respectively. EPGLNV and KPGVCC were used to design D1 and D2 and therefore were present as expected. NCTT immediately following EPGLNV and PPA immediately preceding KPGVCC were perfect matches to the N-terminal and the internal peptide sequences of Lustrin A (Table I) (31), thus confirming the correspondence of the amplified 242-bp fragment with part of the sequence encoding Lustrin A. Two gene-specific primers, G1 and G2, were designed based on the sequence obtained from this fragment (see "Materials and Methods").
To obtain a full-length cDNA clone of Lustrin A, a cDNA library was constructed starting from mRNA extracted from the mantle pallial cells that secrete the shell precursors. This library was screened by PCR using the gene-specific primers G1 and G2 as described under "Materials and Methods." One positive clone (clone 1) was isolated; its insert of 993 bp included the original 242-bp RT-PCR fragment, and the deduced polypeptide sequence contained a perfect match for a second internal peptide sequence of Lustrin A (Fig. 1). Although this result verified that clone 1 encodes Lustrin A, it apparently does not contain the full-length cDNA as it lacks both the start and stop codons.
Rescreening the mantle cDNA library with a DIG-labeled insert from clone 1 resulted in the identification of 15 positive clones from approximately one million clones. DNA was extracted from all 15 clones, and restriction mapping analyses were performed. Based on the resulting restriction patterns, the 15 clones can be classified into two groups. One group contained four positive clones for Lustrin A. The longest clone (designated clone 5) had an insert of 4,407 base pairs (Fig. 1). The other group contained 11 clones that exhibit restriction patterns different from that of Lustrin A. The partial nucleotide sequence of clone 7 in the second group was found to encode a polypeptide almost identical to Lustrin A with the exception of a few single amino acid substitutions (Fig. 2) and deletions (data not shown) in the region sequenced. 5Ј-Rapid amplification of cDNA ends was used to obtain the start codon and its flanking sequence (see "Materials and Methods"). This yielded a 319-bp fragment that provided the first 32 nt of the Lustrin A cDNA (Fig. 1).
Northern Blot Analysis-Northern analysis revealed that Lustrin A mRNA is expressed specifically by cells of the mantle epithelium (Fig. 3). Three probes, representing the 5Ј-, middle, and 3Ј-regions of the full-length Lustrin A cDNA, were used: nt 33-1699, nt 3333-3824, and nt 3824 -4439. The first of these probes hybridized to two RNA species, 4.7 kilobases (kb) and 5.5 kb in size, respectively, in the RNA isolated from the mantle pallial tissue but not in the RNA from muscle, gill, or stomach tissues. Identical results were obtained when the other two probes were used (data not shown).
Nucleotide Sequence and Deduced Amino Acid Sequence-Sequence analysis of the 4,439-bp cDNA of Lustrin A revealed an open reading frame encoding 1,428 amino acids with the translation initiation codon ATG at nucleotide position 26 (Fig.  1B). At position Ϫ3 from this initiation codon there exists an adenine nucleotide, and at position ϩ4 there is a guanine nucleotide, representative of a Kozak initiation sequence (44). The first 19 amino acids apparently comprise a signal peptide. It has a hydrophobic core of 11 residues flanked by relatively hydrophilic residues, including a basic arginine residue near the N terminus, similar to known signal peptides found in proteins that are processed in the endoplasmic reticulum and subsequently secreted from the cell. A glycine residue found before the expected cleavage site is consistent with the rule of von Heijne (45). An in-frame stop codon is located at nucleotide position 4309, with a putative polyadenylation signal AATAAT located 13 nucleotides downstream from the stop codon and 10 nucleotides upstream from the poly(A) tail. It thus is very likely that this cDNA represents the full-length copy of the Lustrin A mRNA coding region.
Modular Structure of Lustrin A-Based on the predicted amino acid sequence, Lustrin A is rich in Ser (16%), Pro (14%), Gly (13%), and Cys (9%) residues (Table II) and contains 15 potential N-glycosylation sites. The calculated mass for Lustrin A before any post-translational modifications is 116 kDa.
The deduced amino acid sequence of Lustrin A reveals that it has a modular structure (Fig. 4). Ten cysteine-rich repeats (C1-C10) were identified in Lustrin A. Each repeat consists of 75 to 88 amino acid residues, of which 12 are cysteine. The last two cysteine residues in each repeat are arranged as a Cys-Cys segment, while other cysteine residues are spaced by stretches of 4 to 15 residues. The 10 cysteine-rich repeats share a high degree of sequence identity (45-90%) with each other (Fig. 5).
Cys-X-Cys and Cys-X-X-Cys segments, which are frequently found in heavy metal-binding proteins, are not present in these repeats. These repeats also show no sequence similarity to any of the known calcium-binding motifs in the PROSITE data bank. Near the N terminus of each repeat there is a conserved NCT sequence containing a potential site for N-glycosylation. The regularity and high conservation of its position in each of the cysteine-rich domains suggest a possible functional role. A second potential N-glycosylation site also is present near the center of C1, C2, C4, and C6 (Fig. 5).
The first nine cysteine-rich repeats are connected by eight proline-rich domains (P1-P8). The amino acid sequences of these proline-rich domains are shown in Fig. 6A. These domains are 17 to 30 amino acids in length and contain a high proportion of proline residues, most frequently arranged in Pro-Pro, Pro-X-Pro, Pro-X-X-Pro segments. No sequence homology has been observed among the proline-rich domains. However, the sequence of P1 shows some similarity to that of collagen I from various organisms including mammals and insects (46,47), particularly in the PPGPP that is repeated 3 times in the P1 domain (Fig. 6B). Amino acid analysis of Lustrin A did not detect any hydroxyproline, although it was specifically assayed (31); similarly, the absence of regular third position glycines in P1 further show that this is not a triplehelical collagen domain.
The two cysteine-rich domains nearest to the C terminus, C9 and C10, are connected by a large glycine-and serine-rich domain. Of the 272 residues in this domain, 250 are either glycine or serine. Large domains rich in glycine and serine residues have been found in human and mouse cornified envelope protein loricrin (48,49) and the extracellular matrix protein keratin (50). The GS domain in Lustrin A shares 48% identity with human loricrin, 46% with mouse loricrin, and 47% with a corresponding domain (163 amino acids in size) of human type I cytokeratin 9. Lower correspondence with the silk proteins, sericin (44%) (51) and fibroin heavy chain (40% in 175-amino acid overlap) (52), also was observed.
The cysteine-rich domain C10 is followed by a stretch of 30 amino acids rich in basic residues. Six arginine and four lysine residues are located in this domain. The side chains of these residues are very likely to be positively charged at the pH (ϳ7.4) of the extracellular (extrapallial) space in which matrix assembly and shell mineralization occur (53).
The C-terminal domain of Lustrin A is 45 residues long and exhibits strong sequence similarity to a protein family that   31. Standard amino acid abbreviations; X ϭ unidentified residues. Position numbers refer to positions in the translation product, which includes the signal peptide (Fig. 1B). Sequences in parentheses were used to design degenerate oligonucleotide primers D1 and D2 for RT-PCR. includes several protease inhibitors: red sea turtle basic protease inhibitor (also known as chelonianin) (54), human and pig antileukoproteinase (55,56), the human elastase-specific inhibitor elafin (57), and several other extracellular proteins: rat, mouse, and camel whey acidic protein (58 -60), rat WDNM1 protein (61), and human and chicken Kallmann syndrome protein (62)(63)(64). Proteins in this family are distinguished by an array of eight cysteine residues in their sequence known as the "four-disulfide core" motif (65). The partial alignment of some of these proteins with the C-terminal domain of Lustrin A is shown in Fig. 7. Although the function of the whey acidic protein is not known, it has been hypothesized that the WDNM1 protein and the Kallmann syndrome protein have protease inhibiting activity (61-64). DISCUSSION We have isolated and characterized the cDNA coding for Lustrin A, a nacre matrix protein with a unique modular structure. Northern blot analyses verify that Lustrin A mRNA is synthesized specifically in the mantle pallial cells that are responsible for secreting shell proteins (Fig. 3). The Lustrin A cDNA clone we isolated is 4,439 bp long, only ϳ300 bp shorter than the 4.7-kb transcript observed in the Northern hybridization. This difference in size is readily attributable to the absence of some of the 5Ј-untranslated region and/or the poly(A) tail from the cDNA clone. The detection of two mRNA species with Lustrin A-specific probes indicates either that another gene closely related to Lustrin A also is expressed in the mantle or that the Lustrin A gene can be alternatively spliced to yield two transcripts. In support of this suggestion, we found that FIG. 3. Tissue-specific expression of Lustrin A. Total RNA (25 g) isolated from abalone mantle, muscle, gill, and stomach were subjected to Northern blot analysis (see "Materials and Methods"). Nt 33-1699 of Lustrin A cDNA was used as a probe. Two transcripts (4.7 kb and 5.5 kb in size) were detected in RNA from the mantle tissue. No signal was detected in RNA from other tissues. Identical results were obtained when two other probes, nt 3333-3824 and nt 3824 -4439, were used (data not shown). RNA gel was stained with ethidium bromide before blotting to verify that equal amounts of RNA from different tissues were loaded and no significant degradation of RNA had occurred.

FIG. 2. Comparison of partial amino acid sequences deduced from nucleotide sequences of clones 5 and 7.
Both clones were identified by library screening using DIG-labeled clone 1 insert. Clone 5 codes for Lustrin A, and clone 7 codes for a polypeptide closely related to Lustrin A (see "Results"). Dots in the clone 7 sequence represent amino acid residues identical to those in clone 5. Numbers on top of the clone 5 sequence indicate positions of the amino acid residues in the Lustrin A sequence. clone 7 encodes a polypeptide almost identical to Lustrin A (Fig. 2); this may be the product of another closely related gene or a splicing or allelic variant of Lustrin A.
The calculated mass for Lustrin A before any post-translational modifications is 116 kDa. However, the polypeptide chain extracted from the nacre matrix and used for amino acid sequence analysis is only 65 kDa (31). This discrepancy in size could result from one or more of the following possibilities: (i) Lustrin A may have been fragmented during extraction from its covalently cross-linked network in the matrix (31), (ii) posttranscriptional and/or post-translational modification may reduce the size of the protein, and (iii) Lustrin A and the 65-kDa polypeptide may be encoded by two different but closely related genes. We are producing antibodies against Lustrin A (the protein defined by the sequence of the full-length cDNA cloned and characterized in the work presented here) to investigate these possibilities.
The characteristic feature of Lustrin A is its modular structure, consisting of: (i) ten highly conserved cysteine-rich domains, (ii) eight proline-rich domains, (iii) a glycine-and serine-rich domain, (iv) a basic domain, and (v) a C-terminal domain with marked sequence similarity to known proteinase inhibitors.
There are 10 cysteine-rich repeats in Lustrin A. The high degree of sequence identity among them suggests that they are likely to undergo similar folding. The high frequency and complete positional conservation of the cysteine residues within these repeats suggest that these domains may have globular structures stabilized by intradomain disulfide bonds, although the possibility of some interchain disulfide bonds cannot be excluded. Proline residues also are abundant in the cysteinerich repeats; they are scattered throughout the entire repeat, and their positions are often conserved. Because of the known ␣-helixand ␤-sheet-disrupting effect of proline residues (66), it is not surprising that Chou-Fasman calculations (67) predict that the cysteine-rich repeats contain no extended ␣-helices or ␤-sheets. A search for heavy metal-or calcium-binding motifs in these repeats was unfruitful. We hypothesize that the main function of the cysteine-rich repeats is in protein-protein interactions governing the complex self-assembly of the multicomponent matrix that in turn helps control the mineralization of nacre. Our finding that the yield of Lustrins is increased by the inclusion of thiol reagents in the extraction medium 2 supports our hypothesis that Lustrins may be covalently linked via disulfide bonds with one another or with other proteins of the nacre matrix.
The proline-rich domains, although sharing no sequence homology, are similar in that they all have high proline content. Prolines in these domains generally are located in tandem or at every other one or two positions (Fig. 6A). This density of proline residues is predicted to force these domains to adopt extended structures, most likely in the form of "polyproline II helixes" (extended structures of three residues per turn) (66). Such proline-rich regions often are found to be involved in binding, serving as "sticking arms" (66) by virtue of the exposure of the side chains of other residues for interaction with other molecules, but the lack of homology among the nonproline portions of the eight different domains may make this role in Lustrin A less likely. Because the proline-rich domains all are located between the cysteine-rich domains and will have extended, rod-like structures, their primary function may be to serve as spacers separating the cysteine-rich domains so these can fold independently. One of the proline-rich domains, P1, shows some similarity to collagen I (Fig. 6B). Collagen is the major matrix protein in several other biomineralized extracellular materials such as dentin, bone (68,69), and avian eggshell (70,71). The data shown here demonstrate for the first time the existence of a sequence with some collagen-like domains in the abalone shell nacre matrix, but the absence of hydroxyproline and the regular third-position glycines characteristic of collagen limit this similarity.
The cysteine-rich module and the proline-rich module are arranged in tandem and repeated nine and eight times, respectively, in the N-terminal two-thirds of Lustrin A (Fig. 4). A similar arrangement of cysteine-rich modules and proline-rich modules (with very different sequences) has been observed in frustulins (72,73), a family of glycoproteins intimately associated with diatom cell walls. Interestingly, this arrangement is the only similarity shared by these two families of proteins, which differ in almost all other aspects of their structures. The sizes and numbers of their cysteine-rich repeats, the number of cysteine residues in each repeat, the spacing between the cysteine residues and the hydroxylation of prolines in the prolinerich domains all are dissimilar, as are their sequences. However, it is tempting to speculate that the similar modular arrangement of the biomineralization matrix-like proteins from the calcium carbonate shell of an animal and the silica "shell" of a unicellular microalga may reflect the convergent evolution of diblock copolymer-like protein domains that might serve related functions.
The glycine-, serine-rich region is the largest discrete structural domain in Lustrin A. It contains 85 glycine and 165 serine residues (for a total of 250 residues out of 272), most in (GS) x and (GSSS) y repeats. Analyses of secondary structure using the Chou and Fasman method (67) indicate that this domain has no ␣-helix or ␤-sheet content but is rich in turns. Flexibility indices indicate a high degree of flexibility attributable to the large number of glycine residues. Six aromatic residues, including four tryptophans and two phenylalanines, are located in this domain, with the phenylalanine in tandem with the tryptophan. It is well known that the aromatic side chains have a tendency to stack through alignment of their phenyl rings at a preferred distance of 4.5-7 Å parallel (74) or perpendicular (75) to one another. This association of the tryptophan and phenylalanine side chains thus will force the (GS) x and (GSSS) y repeats to form "glycine loops" (76), although in Lustrin A serine rather than glycine will be the predominant residue in the loops. Because the interactions between the phenyl rings are weak, the glycine loops can be reversibly opened when a stretching force is applied. Such rubber-like glycine loops pre- Boxed sequence (P1) shows a limited sequence similarity to collagen I (see Fig. 6B and discussion of limitations in text). The glycine-and serine-rich domain (GS domain) also is boxed. Aromatic residues in the GS domain are shaded. Basic residues in the basic domain are in bold. Lysine, asparagine, and tyrosine residues in the basic domain are underlined. Highlighted C-terminal domain shares homology with proteins in the four-disulfide core family (see Fig. 7). B, schematic representation of the modular structure of Lustrin A. enabling it to function as an extensor molecule.
The strikingly arginine-and lysine-rich basic domain in Lustrin A suggests that this region would have the capacity to interact with the anionic molecules during formation of the shell. Three classes of anions are involved (or potentially involved) in this mineralization: (i) the CO 3 2Ϫ that participates directly in formation of the CaCO 3 crystal lattices, (ii) the polyanionic aragonitic proteins shown to be responsible for cooperative control of crystal polymorph selection and atomic lattice orientation (22,26,27), and (iii) the anionic groups of sugars in the matrix polysaccharides and glycoproteins, potentially including carboxylates, uronates, sulfates, etc. Additionally, of course, this domain also has the capacity to participate in protein-protein cross-linking by virtue of its high lysine, tyrosine, and glutamine content (77,78). There are four lysine, three tyrosine, and two glutamine residues in this domain. Lysine and glutamine residues can form N ⑀ -(␥-glutamyl)lysine isopeptide bonds, tyrosine residues can form diphenyl linkages, and the lysine side chains can be modified to participate in intermolecular cross-linking via Schiff's base conjugates. While only speculative at this point, it is interesting to find potential anchoring sites for the polyanionic aragonite-determining proteins in this domain.
The presence of a C-terminal protease inhibitor-like domain in Lustrin A suggests that if this domain is functional the protein also may have the capacity to bind proteases or other proteins with protease-like structures. Covalent linkage of the N terminus of this domain to the polypeptide chain preceding it is unlikely to affect this binding capacity, based on the following observations. First, proteins of the four-disulfide core family analyzed thus far all have similar structures due to their four intramolecular disulfide bonds. Crystal structures of the elastase-specific inhibitor elafin (79) and human seminal plasma inhibitor (also called human antileukoproteinase) (80), and the solution structure of Na ϩ ,K ϩ -ATPase inhibitor SPAI-1 revealed by NMR (81) all exhibit a stretched planar spiral shape. Because the C-terminal domain of Lustrin A contains the eight cysteine residues at the canonical positions, it may exhibit a similar structure. Second, the N-terminal region of elafin is flexible, located on the opposite side of the active site loop and not essential for elastase inhibitory activity (79). Third, two protease inhibitor domains are present in human antileukoproteinase. The N-terminal domain binds trypsin, and the C-terminal domain binds chymotrypsin. The covalent linking of the second domain to the first does not affect its structure or function (80). It is possible that the C-terminal protease inhibitor-like domain in Lustrin A functions as an active protease inhibitor, protecting the secreted proteins of the shell-forming matrix and machinery from degradation. Incorporation of protease inhibitors into an outer protective layer seems to be a common scheme used by a variety of organisms. Thus, for example, elafin (a serine protease inhibitor) and cystatin ␣ (a cysteine protease inhibitor) are found cross-linked to loricrin and other cornified cell envelop proteins in the keratinocytes during terminal differentiation in mammalian epidermis (82,83).
It is not surprising to find that Lustrin A shows partial homology to several extracellular matrix proteins, as the molluscan shell is a specialized extracellular material produced from epithelial secretions. It is interesting to note, however, that Lustrin A combines several structural motifs into one molecule: the alternating cysteine-rich module and proline-rich module design seen also (although to a lesser extent) in frustulins, the glycine loop motif of loricrins and keratins, the four-disulfide core motif of several protease inhibitors, and a basic domain. Such a combination is unique and suggests that Lustrin A is a multifunctional protein. In addition to its structural role in the insoluble nacre matrix framework, Lustrin A also may play important roles interacting with the polyanionic aragonite-determining proteins, protecting the protein components of the matrix from degradation, and conferring elastic resiliency to the high performance microlaminate composite of the molluscan shell.