Cloning and Characterization of a Novel Mouse Siglec, mSiglec-F DIFFERENTIAL EVOLUTION OF THE MOUSE AND HUMAN (CD33) Siglec-3-RELATED GENE CLUSTERS* □ S

A novel mouse Siglec (mSiglec-F) belonging to the subfamily of Siglec-3-related Siglecs has been cloned and characterized. Unlike most human Siglec-3 (hSiglec-3)-related Siglecs with promiscuous linkage specificity, mSiglec-F shows a strong preference for (cid:1) 2–3-linked sialic acids. It is predominantly expressed in immature cells of the myelomonocytic lineage and in a subset of CD11b (Mac-1)-positive cells in some tissues. As with previously cloned Siglec-3-related mSiglecs, the lack of strong sequence similarity to a singular hSiglec made identification of the human ortholog difficult. We therefore conducted a comprehensive comparison of Siglecs between the human and mouse genomes. The mouse genome contains eight Siglec genes, whereas the human genome contains 11 Siglec genes and a Siglec-like gene. Although a one-to-one orthologous correspondence between human and mouse Siglecs 1, 2, and 4 is confirmed, the Siglec-3-related Siglecs showed marked differences between human and mouse. We found only four Siglec genes and two pseudogenes in the mouse chromosome 7 region syntenic to the Siglec-3-related

Siglecs 1 (sialic acid-binding immunoglobulin superfamily lectins) are a family of cell surface proteins defined by certain shared structural motifs in the first two Ig-like domains and by their ability to recognize sialic acids via the first Ig V-set domain (1)(2)(3)(4). Each Siglec has a distinct expression pattern (3,4), suggesting that these molecules play unique roles in the cells expressing them. In humans, ten canonical Siglecs (3)(4)(5) and one Siglec-like molecule (6,7) have been reported so far, and our analysis of human genome sequences suggests that there is only one additional Siglec. 2 In view of the usefulness of mice for functional gene analysis, especially by gene disruption, it is highly desirable to identify and analyze mouse orthologs of human Siglecs (hSiglecs). Mouse Siglecs (mSiglecs) orthologous to hSiglec-1/Sialoadhesin, hSiglec-2/CD22, and hSiglec-4/myelin-associated glycoprotein (MAG) have been well-characterized, with the latter two including phenotypes with gene disruptions (8 -12). Siglec-3related Siglecs, which include all other members not listed above, are characterized by their sequence similarity in the first two Ig-like domains, and by the presence of characteristic tyrosine-based signaling motifs in their cytosolic domains, an immunoreceptor tyrosine-based inhibitory motif (ITIM) and another putative signaling motif situated nearby (3,4). Most of these Siglec-3-related human genes are clustered together on a region of ϳ500 kilobases (kb) of chromosome 19q (cytological band 19q13. 3-13.4), suggesting their origin through repeated gene duplications (3,4). The Siglec-3-related gene cluster in the human genome contains seven functional Siglec genes (Siglecs 3 and 5 through 10), one Siglec-like gene (Siglec-L1/S2V), and at least 13 Siglec-like pseudogenes (6). On the other hand, only two Siglec-3-related Siglecs have been reported so far in the mouse, i.e. mSiglec-"3"/CD33 (13) and mSiglec-E/MIS (14,15). However, neither of these shows a clear one-to-one orthologous correspondence to hSiglecs (14,16). Although it is likely that these mSiglec-3-related Siglecs, as well as ones yet to be reported, are present in a similar gene cluster in mouse chromosome 7 (17), there has so far been no report of a comprehensive □ S The on-line version of this article (available at www.jbc.org) contains Tables I and II.
The nucleotide sequence(s) reported in this paper has been submitted to the GenBank TM /EBI Data Bank with accession number(s) AF293371.
Here we report the molecular cloning and characterization of a novel mSiglec, as well as a genomic overview of Siglec-3related gene clusters on mouse chromosome 7 and human chromosome 19. Through multiple approaches, the human and mouse Siglec orthology issue is addressed. Our analyses provide evidence of complex and differential evolutionary paths that CD33-related Siglecs have taken in the rodent and primate lineages.

EXPERIMENTAL PROCEDURES
Materials-Reagents were purchased from Fisher or Sigma, unless otherwise stated. Access rights to the assembled mouse genomic DNA sequence was purchased from Celera Genomics (Rockville, MD), under an agreement with the University of California System. Corresponding genomic DNA sequences in GenBank TM deposited by publicly funded sequencing projects were also utilized.
Cloning of a Novel mSiglec (mSiglec-F) cDNA-An EST clone (Gen-Bank TM accession number AA672534), which shows sequence similarity to hSiglec-5, was identified and kindly provided by Dr. Paul Crocker (Dundee University, UK). The clone contained the 3Ј-end of the cDNA (ϳ1.6 kb), including the coding sequence for the third (partial) and fourth Ig-like domains, a transmembrane domain and cytosolic tail, and a 3Ј-untranslated region (UTR). This putative mSiglec was tentatively given the letter designation mSiglec-F, 3 because it is the sixth mSiglec reported so far and because of the difficulty in unequivocally identifying the human ortholog of this molecule. In an independent attempt to clone novel mSiglecs in our laboratory, a few cDNA fragments corresponding to the first Ig-like domain of mSiglecs were cloned, one of which was later revealed to be identical to mSiglec-F. 4 Mouse bone marrow total RNA (C57Bl/6 strain) was prepared with an RNeasy Midi kit (Qiagen) and used for the cDNA cloning. The 5Ј-end cDNA sequence of mSiglec-F was elucidated by 5Ј-rapid amplification of cDNA ends (RACE), as follows: the first-strand cDNA was synthesized with Superscript II (Invitrogen) using mSL5-SP1 (5Ј-CTGAACTCTAT-TGTCACTCC-3Ј) as primer, and oligo(dA) was added to the 3Ј-end of the first-strand cDNA using terminal deoxynucleotidyl transferase (Invitrogen) and dATP. Gene-specific cDNA fragments were amplified using the Expand Long Template PCR system (Roche Molecular Biochemicals), with first-round PCR using dT-NXB (5Ј-GCTAGCTCGAG-GATCCT 18 -3Ј) and mSL5-SP2 (5Ј-TGTCACTCCAAGCTTCAC-3Ј), followed by nested PCR using NXB (5Ј-TTTGCTAGCTCGAGGATCCTT-TT-3Ј) and mSL5-SP3 (5Ј-GACTTAACCGTGAAGGAGCC-3Ј) as primers. DNA fragments obtained were cloned into pCRII-TOPO TA cloning vector (Invitrogen), and sequences were analyzed. Primers were designed based on the revealed sequence, and a full-length cDNA was cloned by reverse transcription-PCR, as follows: random-primed firststrand cDNA was synthesized using SuperScript II with random DNA hexamer as primer, and the cDNA was amplified by PCR with primers mSL5 5Ј-UTR (5Ј-TAGAGCCTTTCACTGTGTGATGCGG-3Ј) and mSL5 3Ј-UTR1 (5Ј-CTTCATTCCTTCCAAAACTCCCTCC-3Ј), followed by a nested PCR with mSL5 5Ј-Expr (5Ј-CCCTCTAGAGCCACCATGCGGT-GGGCATGGCTGCTGCCCC-3Ј; XbaI site is underlined) and mSL5 3Ј-UTR2 (5Ј-CGCGGATCCGGACTCCAGCCAGCACTGGG-3Ј; BamHI site is underlined). The resulting DNA fragments were digested with XbaI and BamHI, cloned into XbaI-BamHI sites of pBluescript II (Stratagene), and sequences were analyzed.
Preparation of a Full-length mSiglec-F Expression Construct and Introduction of an R114A Point Mutation-The cDNA fragment of mSiglec-F (XbaI-BamHI) was subcloned into XbaI-BamHI sites of pcDNA3.1(Ϫ) for expression of the full-length protein in mammalian cells. A point mutation (R114A) was introduced into mSiglec-F/pBluescript II using a QuikChange site-directed mutagenesis kit (Stratagene), using primers mSL5 R114A Fwd (5Ј-GGGACGTACTTCT-TCGCCTTGGATGGTTCTGTG-3Ј; the codon for A114 is underlined) and mSL5 R114A Rev (complementary to mSL5 R114A Fwd). The DNA sequence was analyzed, and an appropriate fragment (XbaI-BamHI) was subcloned to XbaI-BamHI sites of pcDNA3.1(Ϫ).
mSiglec-F-Fc Protein Preparation-The mSiglec-F cDNA fragment encoding the first two Ig-like domains was amplified by PCR using Pwo polymerase (Roche Molecular Biochemicals) with primers mSL5 5Ј-Expr and mSL5 3Ј-Chi2D (5Ј-ATCTCCCCAGGACACCCTGATGGTC-3Ј), digested with XbaI and cloned to XbaI-EcoRV sites of EK-Fc/pEDdC vector (16,18,19). Upon mammalian cell transfection, the resulting construct (mSiglec-F-EK-Fc/pEDdC) expressed a recombinant soluble mSiglec-F-Fc protein (a fusion protein of the first two Ig-like domains of mSiglec-F and human IgG Fc fragment, with an enterokinase cleavage site/FLAG tag (DYKDDDDK) between them). The construct also contains mouse dihydrofolate reductase (DHFR) cDNA as a selection marker. A CHO cell line stably expressing mSiglec-F-Fc protein was established by transfection of CHO-DHFR Ϫ cells with mSiglec-F-EK-Fc/pEDdC, followed by selection in minimal essential medium-␣ supplemented with 10% dialyzed fetal calf serum and 20 -200 nM methotrexate. Clones expressing the recombinant protein were established by manual colony isolation, and the best clone was expanded. Recombinant fusion protein secreted into culture medium was purified by adsorption to Protein A-Sepharose (Amersham Pharmacia Biotech).
Sialic Acid Binding Specificity of mSiglec-F-Sialic acid linkage specificity was analyzed by a standard ELISA assay as previously described (16,18,19). Briefly, protein A (5 g/ml in 50 mM sodium bicarbonate buffer, pH 9.5) was immobilized to the wells of 96-well plates (Nunc, 269620) overnight at 4°C. The wells were washed three times with ELISA buffer (20 mM HEPES, 125 mM NaCl, 1 mM EDTA, 1% bovine serum albumin, 0.01% NaN 3 , pH 7.45), blocked with the same buffer at room temperature for 1 h, followed by successive incubation with the following (each followed by three washes with the ELISA buffer): mSiglec-F-Fc (5 g/ml), 2 h; biotinylated polyacrylamide probes (PAA-Bio) multiply substituted with sialylated oligosaccharides (Glycotech, Rockville, MD; 10 g/ml), 2 h; alkaline phosphatase-conjugated streptavidin (Invitrogen, 1:1000 dilution), 1 h. After the final wash, phosphatase substrate (10 mM p-nitrophenyl phosphate in 100 mM sodium carbonate and 1 mM MgCl 2 ) was added to each well, and absorbance at 405 nm was measured using a microplate reader after incubation at room temperature up to 3 h.
Preparation of Rat Monoclonal Antibodies against mSiglec-F-Lou rats (Harlan, Indianapolis, IN) were immunized intraperitoneally with mSiglec-F-Fc. Briefly, rats were immunized three times with 50 g of immunogen in Titermax Gold (Cytrx, Norcross, GA) followed by a final boost with 50 g of immunogen in phosphate-buffered saline. After fusion of the spleen cells with the Yb/20 myeloma cells, the hybridomas were screened by ELISA on antigen-coated plates. The initial screen on mSiglec-F-coated plates was followed by a second screen on human IgG-coated plates. Clones positive only for mSiglec-F were tested by flow-cytometry on primary cells. Positive clones were subcloned twice and isotyped using the monoclonal antibody-based rat Ig Isotyping kit (BD Biosciences Pharmingen, San Diego, CA). All clones gave a similar staining pattern on primary cells. The selection of clone E50 -2440 (rIgG2a) was based on its staining intensity on primary cells.
on FACSCalibur (BD Biosciences, San Jose, CA), and the profiles were analyzed using the software CellQuest (BD Biosciences).
Immunohistochemical Analysis Using the mSiglec-F-specific Antibodies-Immunohistology was performed on frozen sections of a number of mouse tissues using purified rat antibody E50 -2440 to mSiglec-F. Following a 10-min fixation in acetone, the sections were washed in Tris-buffered saline (TBS), and nonspecific binding sites were blocked with 1% bovine serum albumin in TBS. Endogenous biotin in tissues was blocked by sequential incubations in 0.1% avidin, followed by washes and then 0.01% biotin, followed by washes. Anti-mSiglec-F antibody, at 1 g/ml, was incubated on the tissue sections for at least 30 min in a humidified chamber. Irrelevant primary antibodies were used as controls. Unbound antibody was washed off, and the sections were sequentially incubated with biotinylated anti rat antibody (BD Biosciences Pharmingen, San Diego, CA) for 30 min, followed by washes, and then with alkaline phosphatase-conjugated streptavidin (Jackson ImmunoResearch), followed by washes. Bound antibody was then detected using the Vector Blue substrate, nuclei were counter-stained using nuclear fast red (Vector Laboratories, Burlingame, CA), and the sections were then mounted for viewing under a bright field microscopy. Digitized images were captured using a Sony DKC 5000 charge-coupled device camera, IMAGE (National Institutes of Health), and Photoshop (Adobe).
For double immunostains, fluorescent labels were used with sections of mouse thymus and spleen. Serial sections were first singly immunostained using either anti-Siglec-F or biotinylated anti-CD11b antibodies, and separate sections were used for the double labeling. Specific binding was detected using CY3-conjugated donkey anti-rat or CY2labeled streptavidin (Jackson ImmunoResearch). Digitized photographs were taken using epifluorescence with a Zeiss microscope and red or green barrier filters. Overlays were performed digitally using Photoshop.
Genomic Sequence Analysis-We searched for Siglec genes and pseudogenes in mouse genomic DNA sequence in the Celera mouse genome data base using the TBLASTN program (20) with known human and mouse Siglec sequences (the N-terminal 150 amino acids, approximately covering the first Ig-like domain) as templates. Identification of hSiglec pseudogenes was previously reported (6). Sequence matches with expectation (E) value Ͻ 1 were considered significant. Only DNA segments encoding a remnant (fossil) of full V-set domain exon were considered as pseudogenes. Thus, a very short (ϳ100 nucleotides) DNA segment immediately upstream (ϳ0.5 kb) of mSiglec-3/CD33 gene showing significant similarity to mSiglec-3 was not counted as a pseudogene. A DNA segment showing very high (Ͼ95%) sequence identity to mSiglec-G but lacking the V-set domain was also excluded.
Genomic structures of human and mouse Siglec genes were deduced from sequence alignment of cDNA with genomic DNA, with adjustments to conform to standard sequence context of splicing sites.
Phylogenetic Analysis-Nucleotide sequences for most of the Siglec cDNAs from mouse (mSiglecs 1 through 4 and E) and human (hSiglecs 1 through 11 and L1) were obtained from GenBank TM . The sequence of mSiglec-H was determined by sequencing an EST clone (IMAGE clone number 1430130) containing a full-length cDNA of this gene. 5 The cDNA sequence of mSiglec-G was deduced from a corresponding genomic DNA region based on sequence similarity to hSiglec-10. Nucleotide sequences encoding the signal peptide and the first and second Ig-like domains of Siglecs were translated and aligned using ClustalW (available at www2.ebi.ac.uk/clustalw) and used for phylogenetic analysis using PAUP 4.0 (Sinauer Associates). A phylogenetic tree was reconstructed using the neighbor-joining method (21) from a distance matrix based on mean amino acid difference. Phylogenetic trees of essentially identical topology were obtained by the maximum parsimony method (22,23) as well as from a distance matrix based on nucleotide sequences (data not shown). The sequence of hSiglec-L1 was omitted from the analysis because of its unusual structure (two V-set Ig-like domains) (6,7). A phylogenetic tree of only the Siglec-3-related Siglecs and Siglec-4, based on the amino acid sequences of transmembrane ϩ cytosolic domains, was also reconstructed by the same method. The sequence of mSiglec-H was omitted from this analysis due to its very short cytosolic domain, which greatly restricts the amount of information usable for phylogenetic analysis if included.

RESULTS
Cloning of a Novel mSiglec cDNA-A partial mouse EST clone showing high sequence similarity to hSiglec-5 was identified, and full-length cDNA of this molecule was cloned through 5Ј-RACE and reverse transcription-PCR from mouse bone marrow. The cDNA encodes a Siglec-like molecule (569 amino acids) with four Ig-like domains, a single-pass transmembrane domain and cytosolic tail (Fig. 1A). The amino acid sequence of the translated product contains the defining structural features of Siglecs, i.e. three amino acids whose side chains are known to contact sialic acid in Siglec-1/Sialoadhesin (a conserved arginine residue and two aromatic amino acids, one near the N-terminal and the other near the conserved arginine residue) (24), as well as three conserved cysteine residues in each of the first and second Ig-like domains (Fig. 1A). Judging from the sequence similarity of the first two Ig-like domains and cytosolic tail, this new molecule belongs to a group of Siglecs related to Siglec-3, which typically contains two putative tyrosine-based signaling motifs in both human and mouse (3,4). However, despite high sequence similarity to hSiglec-5 in the fourth Ig-like domain and shared Ig-like domain numbers, the similarity is not exclusive to hSiglec-5 in the other regions of the molecule (Fig. 1B). Because this new mouse molecule is the sixth mSiglec to be reported and its human ortholog could not be unequivocally established at this point, it was tentatively given the letter designation mSiglec-F. The previously reported mSiglec-"3"/CD33 (13) and mSiglec-E/ MIS (14,15) also lack clear orthologous correspondence to hSiglecs. As discussed below, this issue is addressed here by multiple approaches, including phylogenetic analysis and whole-genome survey of human and mouse Siglecs, as well as analyses of gene structure and chromosomal localization.
Sialic Acid Binding Specificity of the Novel mSiglec-COS-7 cells transiently transfected with mSiglec-F full-length expression construct showed clear sialic acid-dependent binding of human erythrocytes (Fig. 2), justifying the denomination of this molecule as a Siglec (1). A point mutation at the conserved arginine residue (R114A) abolished this binding, proving that this arginine residue is essential in sialic acid recognition by this molecule (Fig. 2), as has been shown in all other Siglecs studied to date (6,16,18,(25)(26)(27)(28). Unlike some previously reported Siglecs, pretreatment of the COS cells was not essential to liberate the cell surface Siglec from masking by cisligands (Fig. 2). To determine a more detailed ligand binding specificity, a recombinant fusion protein mSiglec-F-Fc was prepared and used in an ELISA assay, with biotinylated polyacrylamide arrays carrying multiple copies of sialylated oligosaccharides as ligands (16,18,19). As shown in Fig. 3, mSiglec-F showed a strong preference toward ␣2-3-linked sialic acids. This is in contrast to many Siglec-3-related hSiglecs, which show rather promiscuous binding to both ␣2-3and ␣2-6linked sialic acids (3,29). Substitution at the proximal GlcNAc with fucose residue (i.e. sialyl Le x ) strongly interfered with the interaction. The latter finding is similar to that for hSiglecs recognizing ␣2-3-linked sialic acids (29), with the known exception of hSiglec-9 (16).
Expression Pattern of the Novel mSiglec-Rat monoclonal antibodies against mSiglec-F were prepared as described under "Experimental Procedures" and used for flow-cytometry and immunohistochemical analyses. Four clones with identical staining pattern were obtained, and the clone (E50 -2440) with the highest staining intensity on primary cells was selected for the phenotyping studies. The staining profile in bone marrow shows that mSiglec-F is present on CD11b lo as well as on 5 T. Angata and A. Varki, unpublished.
CD11b hi cell populations (Fig. 4A). However, the expression of mSiglec-F is considerably higher on the CD11b lo population. A similar staining profile for mSiglec-F was seen in cells isolated from mouse blood and spleen, although the actual percentage of CD11b-positive cells was lower in these compartments (data not shown). Detailed analysis with appropriate markers (see "Experimental Procedures") indicated that mSiglec-F is not expressed on T cells, B cells, or NK cells, thus restricting the expression of these cells to the myeloid and monocytic lineages (CD11bϩ or Ly6Gϩ, Fig. 4A and data not shown). Cells expressing mSiglec-F are also positive for CD54 and CD95 but negative for CD14, CD11c, CD80, CD117, Ly6A/E, Ly6C, and Mac-3. Identical staining profiles for all these tissues was seen in C3H, Balb/c, and C57BL/6 strains of mice, indicating that expression of mSiglec-F is not restricted by strain differences (data not shown).
To further characterize the cell populations expressing mSiglec-F, bone marrow cells that had been sorted by fluorescenceassisted cell sorting were cytospun onto slides, air-dried, then stained with the Wright-Giemsa stain, and then studied under oil-immersion microscopy. Cells that were CD11b lo /mSiglec-F hi were identified as being predominantly immature cells of myelomonocytic lineage (Fig. 4B). Cells that were CD11b lo-hi /mSiglec-F lo were mostly mature or immature neutrophils and monocytes (data not shown), including leukocytes with ringshaped nuclei that are now thought to include murine granulocytes, monocytes, and their immediate precursors (30). In contrast, the CD11b neg /mSiglec-F neg population consisted mainly of lymphocytes and erythrocyte precursors (data not shown). The cells showing high Ly6G (Gr-1) expression (presumably the most mature neutrophils) were not strongly mSiglec-F-positive (data not shown), suggesting that the level of this Siglec falls as the cells of the myelomonocytic lineage reach full maturity. However, this conclusion remains uncertain, because of the current controversy over the precise morphological basis for differentiating between mouse neutrophils and monocytes and their immediate precursors (30).
Immunohistochemical analysis of tissue sections also supports the conclusion that mSiglec-F expression is restricted to the cells of myelomonocytic lineage (Fig. 5). In the thymic FIG. 1. Primary sequence of mSiglec-F. A, cDNA and amino acid sequences of mSiglec-F. Annotations are as follows: "essential" arginine in typical Siglec V-set domains, double circle; aromatic amino acid residues typical of Siglec V-set domains, circle; signal peptide and transmembrane domain, underlined with hatched and double lines, respectively; exon junctions, arrowheads; potential N-glycosylation sites, underlined; putative ITIM motif and another tyrosine-based motif conserved among Siglecs, boxes with solid and hatched lines, respectively. B, sequence alignment of mSiglec-F with hSiglecs 5 and 6. Domain borders (exon junctions) are shown with arrowheads. medulla (Fig. 5, A-C), most of the cells that expressed mSiglec-F also expressed CD11b (Mac-1); however, there were also rare single positive mSiglec-F cells seen. In the spleen, the expression of mSiglec-F did not completely overlap with that of CD11b (Fig. 5, D-F), whereas it showed better correlation with that of Ly6G (Gr-1) (data not shown). Scattered mSiglec-F cells were also noted in the lung parenchyma. No staining was seen in the heart, liver, kidney, pancreas, lymph nodes, brain, salivary gland, or gastrointestinal tract. Taken together, the data indicate that there is a prominent expression of mSiglec-F on immature bone marrow cells of the myelomonocytic lineage, and a decreased expression in mature neutrophils and mono-cytes. The nature of the mSiglec-F-positive cells found in thymus, spleen, and lung requires further investigation. Of particular note, the high level of mSiglec-F expression on immature bone marrow cells of myelomonocytic lineage (see Fig. 4) provides the first known marker that is selective for these cell populations in the mouse.
Comparison of Mouse and Human Siglec-3-related Gene Clusters-To gain an overview of all mSiglecs and to compare the structure of mouse and human Siglec-3-related gene clusters, we analyzed mouse genomic DNA sequences recently assembled to near completion by Celera Genomics. The previously known mSiglecs with clear human orthologs (Siglec-1/ Sialoadhesin, Siglec-2/CD22, and Siglec-4/myelin-associated glycoprotein) were confirmed to be single-copy genes. We also found a mouse genomic region (chromosome 7, cytological band 7B2) syntenic to the hSiglec-3-like gene cluster (chromosome 19, cytological band 19q13. 3-13.4) in which a number of mSiglec genes and pseudogenes are clustered (Fig. 6). Although the same genomic region is also being sequenced by the publicly funded mouse genome sequencing project, the relevant scaffolds are not assembled into a contiguous entity as of the time of this writing. In the human genome, the Siglec-3-related gene cluster is flanked by the kallikrein gene cluster on the centromeric side by ϳ40 kb (the closest one is KLK-L6/KLK14), and by a hyaluronic acid synthetase gene (hyaluronan synthase 1 (HAS1)) at the telomeric side by ϳ90 kb. The mSiglec-3-related gene cluster is also flanked by the kallikrein gene cluster on one side. On the other hand, the HAS1 ortholog was not found near the cluster. According to a recent study (31), there is a discontinuity of mouse-human synteny between ETFB (electron transfer flavoprotein ␤ subunit) and HAS1. In fact, a region of ϳ3 Mb containing the mouse Siglec-3-related gene cluster appears to have undergone intrachromosomal translocation, whereas the genomic segment containing mouse HAS1 is found on mouse chromosome 17 (31). Nevertheless, all but one canonical Siglec-3-like sequence (mSiglec-H) are found in the above cluster in the mouse genome. Thus, we consider the mouse Siglec-3-related gene cluster on chromosome 7 to be unique and the only cluster equivalent to that in the human genome.
Besides mSiglec-F, there is only one other Siglec-like putative gene (mSiglec-G) in the cluster yet to be formally reported. It appears to be an ortholog of hSiglec-10, judging from its high sequence identity to hSiglec-10 and similar chromosomal localization in relation to the proximal genes (ETFB, NKG7, and LIM2). Including this putative gene (which is a functional gene according to Dr. P. Crocker 6 ), there are four Siglec genes and two pseudogenes in this region. This is in striking contrast to the syntenic region of the human genome, which contains seven Siglec genes, one Siglec-like gene (Siglec-L1), and thirteen pseudogenes (exon fossils of Siglec-like Ig V-set domains). As mentioned earlier, the Siglec-3-related mouse genes do not show strong one-to-one correspondence to hSiglecs, with the only exception of mSiglec-G, the putative hSiglec-10 ortholog. However, comparison of the gene maps of Siglec-3-related human and mouse gene clusters, as well as the gene structures of individual Siglecs (Fig. 7), provide clues to address the orthol-ogy between mouse and human Siglecs. First, the previously reported mSiglec-"3"/CD33 (13) has a shared domain structure with hSiglec-3 (Fig. 7) and a similar chromosomal localization in relation to other Siglec genes (Fig. 6). Thus, it is the best candidate for an ortholog of hSiglec-3/CD33. By the same criteria, mSiglec-E/MIS (14,15) is a probable ortholog of hSiglec-9. Although hSiglecs 7 and 9 are localized close together in the human gene cluster and show a high degree of similarity at the mRNA and polypeptide levels, their genomic structures are different: Although hSiglec-7 gene contains a remnant of V-setdomain exon (Siglec-P2) between the exons encoding the first and second Ig-like domains of the protein product, hSiglec-9 and mSiglec-E lack this signature motif (Fig. 7). Identification of the mSiglec-F ortholog in humans is not as straightforward, due to the lack of useful genomic landmarks and the presence of two candidate genes, namely hSiglecs 5 and 6; the former has exons for four Ig-like domains, whereas the latter has exons for three Ig-like domains and one exon fossil similar to the fourth Ig-like domain exon in mSiglec-F and hSiglec-5 (Fig. 7). However, in view of the fact that, unlike hSiglec-6, mSiglec-F is not expressed on B lymphoid cells but restricted to cells of myeloid lineage, we consider that hSiglec-5 is more likely to be the ortholog of mSiglec-F. Further support comes from the finding that mSiglec-F is not expressed in the mouse placenta (data not shown), whereas hSiglec-6 is expressed in this fetal organ (19).
Phylogenetic Analysis of Human and Mouse Siglecs-To further understand the phylogenetic relationships of the mSiglecs and hSiglecs, we separately compared the N-terminal (first two Ig-like domains) and C-terminal (the transmembrane domain plus the cytosolic tail) amino acid sequences of the Siglecs. As shown in Fig. 8A, Siglecs 1, 2, and 4 of mouse and human show a clear one-to-one correspondence in the phylogenetic tree based on the amino acid sequences of the first two Ig-like domains. On the other hand, the Siglec-3-related Siglecs do not show such a clear relationship. However, mSiglec-G appears to form a monophyletic clade with hSiglecs 10 and 11, and mSiglec-E with hSiglecs 7, 8, and 9 (Fig. 8A). In these cases, the lack of one-to-one orthologous correspondence can be attributed to specific gene duplication(s) that took place in the lineage leading to humans after the split of the primate and rodent lineages (estimated to be 80 -100 million years ago) (32). This interpretation is also supported by the phylogenetic tree reconstructed from the amino acid sequences of transmembrane plus cytosolic domains of Siglecs (Fig. 8B). Here again mSiglec-E 6 P. Crocker, personal communication. forms a monophyletic clade with hSiglecs 7, 8, and 9 (and L1), and mSiglec-G groups with hSiglec-10. The fact that hSiglec-11 does not belong to "Siglec-10 clade" in this tree suggests that this gene is a product of a gene duplication and recombination. On the other hand, the evolutionary relationships of other Siglec-3-related genes in this analysis is not quite consistent with our interpretation of orthology: Thus, in the tree based on N-terminal amino acid sequences (Fig. 8A), mSiglec-3 and hSiglec-3 are paraphyletic rather than monophyletic, and the same is true of mSiglec-F and hSiglec-5. Also, this clade contains Siglecs with 2 and 4 Ig-like domains (hSiglec-6 has an exon fossil corresponding to the fourth Ig-like domain, as dis-cussed above). These anomalies may be due to unequal genetic recombination(s), including gene conversion of some Siglec genes, which consist mostly of modular exons encoding one Ig-like domain each, with the potential for "exon shuffling." The phylogenetic tree of C-terminal amino acid sequences (Fig. 8B) does not provide significant clues to resolve this problem, either. A more precise understanding of the evolutionary paths taken by this gene clade will require more data on related genomic sequences from other mammalian species.
Another clear implication of this phylogenetic analysis is that the ancestral Siglec-3-related genes (proto-Siglecs 3, 9, 10, and possibly 5) were already present in the common ancestor of The mouse genomic sequences used for these analyses were largely from the Celera Genomics data base, with some input from the publicly funded mouse genome project.
FIG. 8. Phylogenetic trees of mouse and human Siglecs. A, phylogenetic tree derived from N-terminal amino acids (signal peptide ϩ first and second Ig-like domains) of Siglecs. Human Siglec-L1 was excluded from the analysis due to its unusual structure (two tandem V-set domains). B, phylogenetic tree derived from C-terminal amino acids (transmembrane ϩ cytosolic domains) of Siglec-3-related Siglecs and Siglec-4. Mouse Siglec-H was excluded from the analysis due to its very short cytosolic tail. Siglecs 1 and 2 were also excluded, due to the lack of C-terminal sequence similarity to the Siglecs being analyzed. An isoform of mSiglec-4 lacking exon 12 (42) was used for sequence alignment. Major monophyletic branches containing Siglec-3-related mouse and human Siglecs were indicated with quadrants. Internodes supported by high bootstrap values (for 1000 bootstrap replications) were indicated with closed (Ͼ90%) and open (70 -90%) circles, respectively. The sequences used for these analyses are presented in the supplemental material.
primates and rodents. In addition, at least one of these members of distinct clades seems to have survived in each lineage through repeated gene duplications and gene losses. This finding suggests that, despite common basic properties (sialic acid recognition and cellular signaling), these molecules are not redundant but rather are playing distinct roles, as has been proposed previously, based on their distinct expression patterns in different cell types (3). DISCUSSION When the Siglecs were first recognized as a distinct family of sialic acid binding lectins, there were only four members (1). Of these, Siglec-1/Sialoadhesin, Siglec-2/CD22, and Siglec-4/MAG have clear orthologs in humans and mice, each with a very similar domain structure, cell-type distribution, and sialic acid binding specificity. An apparent murine ortholog of human CD33 (subsequently called Siglec-3 in humans) had already been reported (13). However, the subsequent discovery of an entire subfamily of human Siglec-3-related Siglecs complicated this interpretation (16). The sequence of another Siglec-3-related mSiglec called MIS (15) or mSiglec-E (14) did not serve to resolve the situation. Here we report the molecular cloning and characterization of another novel mSiglec, as well as a comprehensive genomic overview of mouse and human Siglecs, with a particular focus on comparing the mouse and human Siglec-3related gene clusters. The overall conclusion is that humans have many more Siglec-3-related genes and pseudogenes than mice do. This analysis has also revealed differential and complex evolutionary paths, which seemed to have been more extensively exploited in the primate lineage, as well as the ancient origin of this multigene cluster, predating the split of the primate and rodent lineages (estimated to be 80 -100 million years ago).
Based on expression analysis with monoclonal antibodies, phylogenetic analysis, gene structure, and gene maps, we propose that the new mouse Siglec, mSiglec-F, is likely to be a hSiglec-5 ortholog. The general expression pattern of mSiglec-F on hematopoietic cells (on the myelomonocytic cell lineage) is indeed similar to that of hSiglec-5 (33) and dissimilar to that of the other human homolog to mSiglec-F, hSiglec-6, which is found on B cells and placental trophoblasts (19). However, there is no report on the analysis of hSiglec-5 expression in tissues other than peripheral leukocytes. Also, it remains possible that a major chromosomal rearrangement in the past might have eliminated the true ortholog of hSiglec-5 in the lineage leading to mouse, and that another Siglec ancestor might have been recruited into this expression pattern. In this respect, it is of note that recent analysis of mouse chromosome 7 indicates a chromosomal breakage between ETFB and HAS1 in the lineage leading to the mouse, right adjacent to the Siglec-3-related gene cluster (31). A related report has suggested that this and some other breakages in chromosome 7 may be unique to the mouse (or perhaps to rodents), because the linkage maps of chromosome regions syntenic to human chromosome 19 q-arm in primates, dog, cat, and cattle are more or less similar to that of human (17). To resolve this and other Siglec orthology issues with better precision, we need more information on genomic DNA sequences of Siglec-3-related gene clusters of other mammalian species, including other rodents (e.g. rat) and primates (e.g. baboon and chimpanzee), as well as other mammals such as bovids and carnivores. Many of these taxa are currently being actively pursued or considered as candidates for genome sequencing and/or major EST projects.
Human chromosome 19 is unusual in terms of its higher gene density compared with other chromosomes (34). Hence the syntenic chromosomal regions in other mammalian organisms may well become good candidates for pilot study by BAC-based genomic sequencing. To understand the evolution of Siglecs on a deeper time scale, we would have to evaluate other vertebrate branches as well. In this regard, we have noted that some EST clones from African clawed frog Xenopus laevis (GenBank TM accession numbers BG017920 and BG731140) and some genomic DNA fragments of puffer fish (available at www.jgi. doe.gov/programs/fugu.htm) show distinct similarities to mammalian Siglecs, including some of the highly conserved amino acid residues. 7 This strongly suggests that the Siglec family of molecules is present in all vertebrates. On the other hand, we failed to find Siglec orthologs in the Drosophila and Caenorhabditis elegans genomes (16). Thus, true Siglecs may have emerged only in the deuterostome lineage of animals, where sialic acids are prominently expressed.
The presence of putative intracellular signaling motifs in the cytosolic tail of mSiglec-F suggests that it is involved in intracellular signaling, as has been shown in some other Siglec-3related Siglecs (7,14,15,28,35,36). Although it is beyond the scope of this report, the identification of proteins interacting with the cytosolic tail of mSiglec-F is of interest, especially those that interact with the distal motif, which does not conform to canonical ITIM motif but is highly conserved among all Siglec-3-related Siglecs. We are also pursuing natural ligand(s) for the extracellular domain of mSiglec-F, which may regulate intracellular signaling. Although ␣2-3-linked sialic acids are abundantly expressed in mouse tissues, mSiglec-F may prefer a limited number of specific ligands expressing this structure in a preferred manner, or in a particular polypeptide structural context, as in the case of the specific interaction of P-selectin with P-selectin glycoprotein ligand-1 (37). Alternatively, ␣2-3linked sialic acid recognition by mSiglec-F, regardless of other structural contexts, may constitutively send suppressive signals via its cytosolic tail. As in the interaction of killer cell Ig-like receptors and major histocompatibility complex molecules (38), such a suppressive signal may cease in the absence of ligand, e.g. upon contact with abnormally under-sialylated cells (e.g. virus-infected cells), which may in turn lead to elevation of the cellular activity (16). The apparent absence of mSiglec-F masking by cis-ligands conforms to this hypothesis. We are currently pursuing the functional aspects of mSiglec-F, especially the importance of sialic acid recognition by this molecule, via a targeted exon-replacement approach in mice.
Although we consider our genomic sequence-based analysis of human and mouse Siglecs "comprehensive," there are several caveats. First, we are assuming the human and mouse genomic data bases we have probed are complete. Second, we have noted a genomic DNA segment (ϳ1.2 kb) that shows Ͼ99% sequence identity to a 5Ј-region of the hSiglec-5 gene, containing coding sequences for the signal peptide and Ig-like domains 1 and 2 (indicated as a triangle labeled "5* " in Fig. 6). This segment is upstream of the true hSiglec-5 gene, is found in both the public and Celera human genome data bases, and could be confirmed by PCR analysis. 8 Given the extremely high sequence identity to the Siglec-5 genomic region, this DNA segment is likely to be a relic of a very recent partial gene duplication. However, due to the lack of coding sequences for the downstream components (Ig-like domains, transmembrane domain, or cytosolic tail), we cannot be certain if this genomic DNA segment yields a functional mRNA or translated polypeptide. On the other hand, we cannot confidently assign it as a pseudogene. Third, we have noted a GenBank TM entry by the Human Genome Project derived from an unfinished human 7 T. Angata and A. Varki, unpublished. 8 T. Angata and A. Varki, unpublished.
BAC clone sequence (accession number AC026638, clone RP11-423F16) that contains several putative Siglec genes and pseudogenes. Although it is assigned to chromosome 8, we have found that many fragments (contigs) of this clone show extremely high sequence identity (Ͼ99.5%) to another unfinished BAC clone CTD-3187F8 (GenBank TM accession number AC063977), which encompasses a part of the Siglec-3-related gene cluster on chromosome 19. Given the existence of so-called "segmental duplication" (34), or "paralogous genome evolution" (39) in which long genomic DNA segments (up to 200 kb) duplicate in as yet unknown mechanism, it is possible that the chromosome 8 region covered by the clone RP11-423F16 actually represents another Siglec-3-related gene cluster, which was duplicated very recently. However, evidence against this possibility is provided by fluorescence in situ hybridization studies of Siglec-3/CD33 (40) and Siglec-7 (41). In both cases hybridization signals were localized exclusively on chromosome 19 (cytological band 19q13. 3-13.4). If clone RP11-423F16 was indeed derived from chromosome 8, the DNA sequences nearly identical (Ͼ99%) to these genes in the chromosome 8 region represented by this clone should have shown an additional signal in the fluorescence in situ hybridization analysis. Also, there is only one unique Siglec-3-related gene cluster in the Celera Genomics human genome data base. Overall, it seems more likely that this is a technical artifact, such as misidentification of a clone or cross-contamination by an adjacent BAC clone, among other possibilities. Finally, there is one more unique Siglec-like putative gene in the human and mouse genomes that has not been studied. This hypothetical molecule was identified by a homology search of genomic and EST data bases, sequencing of a mouse EST clone, and by mutual subtraction of mouse and human genomic DNA sequences. 9 However, the number and positions of cysteine residues in the first Ig-like domain of this molecule is different from those of typical Siglecs, and one of the cysteines is situated immediately adjacent to the "conserved arginine" residue involved in sialic acid recognition by known Siglecs. Thus, it is questionable if this hypothetical protein will recognize sialic acids. Nevertheless, this gene, which is highly conserved between human and mouse (ϳ85% nucleotide identity), is likely to be of interest in deciphering the relationship of Siglec family to other Ig-like molecules and perhaps the origin of Siglec gene family.