The insulin-like growth factor 2 gene and locus in nonmammalian vertebrates: Organizational simplicity with duplication but limited divergence in fish

The small, secreted peptide, insulin-like growth factor 2 (IGF2), is essential for fetal and prenatal growth in humans and other mammals. Human IGF2 and mouse Igf2 genes are located within a conserved linkage group and are regulated by parental imprinting, with IGF2/Igf2 being expressed from the paternally derived chromosome, and H19 from the maternal chromosome. Here, data retrieved from genomic and gene expression repositories were used to examine the Igf2 gene and locus in 8 terrestrial vertebrates, 11 ray-finned fish, and 1 lobe-finned fish representing >500 million years of evolutionary diversification. The analysis revealed that vertebrate Igf2 genes are simpler than their mammalian counterparts, having fewer exons and lacking multiple gene promoters. Igf2 genes are conserved among these species, especially in protein-coding regions, and IGF2 proteins also are conserved, although less so in fish than in terrestrial vertebrates. The Igf2 locus in terrestrial vertebrates shares additional genes with its mammalian counterparts, including tyrosine hydroxylase (Th), insulin (Ins), mitochondrial ribosomal protein L23 (Mrpl23), and troponin T3, fast skeletal type (Tnnt3), and both Th and Mrpl23 are present in the Igf2 locus in fish. Taken together, these observations support the idea that a recognizable Igf2 was present in the earliest vertebrate ancestors, but that other features developed and diversified in the gene and locus with speciation, especially in mammals. This study also highlights the need for correcting inaccuracies in genome databases to maximize our ability to accurately assess contributions of individual genes and multigene families toward evolution, physiology, and disease.

Recent advances in genomics and genetics in multiple species now provide unprecedented opportunities for gaining novel insights into comparative physiology, evolution, and even disease predisposition (28 -30) through evaluation of information found in public genomic and gene expression databases (31). As an example, examination of the IGF2/Igf2-H19 locus in different mammals has revealed extensive complexity yet remarkable similarity in individual gene structures, in locus organization, and in gene regulation patterns. Human IGF2 consists of 10 exons and 5 promoters, as do several other primate IGF2 genes (3,18,21,22,32), whereas in the mouse the Igf2 gene encodes 8 exons and 4 promoters (33)(34)(35). H19 also varies among mammals. Human H19 has 6 exons and 2 promoters and uses alternative transcription start sites, exon skipping, and differential RNA splicing within exons to generate multiple transcripts (18). Several other primates also have similar regulatory mechanisms for H19 (18), but these same pro-cesses are not present in other mammalian species, in which just a single promoter has been identified (20,36). Collectively, these results demonstrate that some common components responsible for controlling IGF2/Igf2 gene expression and IGF2 function appear to have been present in the earliest ancestors of extant mammals, but because there is significant variability in H19 gene structure, in the ICR, and in transcriptional enhancers, other regulatory elements appear to have developed during species diversification.
The focus of this study is on the Igf2 gene and locus in nonmammalian vertebrates, where available information is far less extensive than in mammals (37)(38)(39)(40). Results based on the combinatorial analysis of public genomic and gene expression databases reveal remarkable conservation of overall locus organization and similarity of Igf2 exons and IGF2 proteins in Ͼ500 Myr of evolutionary diversification, and they support the idea that the Igf2 gene and its locus are phylogenetically ancient in vertebrates.

Characterizing the chicken Igf2 gene
For this analysis, chicken Igf2 was selected as the reference gene for terrestrial vertebrates, primarily because it has been more highly studied than any of the other species found in the Ensembl and UCSC Genome browsers. According to a single peer-reviewed publication (37) and information found in both genome databases as of August 2018, the single-copy chicken Igf2 gene consists of 3 exons and spans 8343 bp of genomic DNA on chromosome 5 (Fig. 1A). The 5Ј and 3Ј ends of the gene  '2', and '3', and introns and flanking DNA as horizontal lines. A scale bar is indicated. B, diagram of newly-characterized chicken Igf2 exon 1 and gene expression data from hepatic RNA-Seq libraries from the SRA NCBI, using 60-bp genomic segments a-i as probes. The DNA sequence below the graph depicts the putative 5Ј end of exon 1, with locations of the 5Ј ends of the longest RNA-seq clones indicated by arrows (the size is proportional to the number of clones identified). A possible TATA box is underlined. C, DNA sequence of the putative 3Ј ends of Igf2 exon 4. Potential polyadenylation signals are underlined and vertical arrows denote possible 3Ј ends of Igf2 transcripts. D, structure of the chicken Igf2 gene, after mapping with cDNA XM_015286525 from the NCBI nucleotide data resource and after the other analyses presented in B and C. Labeling is as in A. E, diagram of chicken IGF2 protein precursors, illustrating the derivation of each segment from different Igf2 exons. Mature 68-amino acid (AA) IGF2 is in blue; parts of the signal peptide are in black, and the E peptide is in red.

Vertebrate Igf2 gene organization and expression
were not defined in any of these resources, and no promoter was characterized. In fact, in contrast to what is normal eukaryotic gene structure (41, 42), presumptive exon 1 (exon 1Ј in Fig.  1A) began with the ATG codon of the IGF2 protein precursor and thus lacked an identified 5Ј UTR (Fig. 1A). Based on these results, it was clear that the chicken Igf2 gene was incomplete.
The NCBI nucleotide database contains two different experimentally determined chicken Igf2 cDNAs. The longer one, XM_015286525, consists of 3369 nucleotides and includes both 5Ј and 3Ј UTRs, and the shorter one contains primarily coding information. By using the larger cDNA sequence to search the chicken genome, a potential new exon was found within the Igf2 locus 5Ј to the presumptive exon 1 noted above. This new exon contained both coding and noncoding DNA. Mapping with the chicken Igf2 cDNA also added 9 bp to the 5Ј end of Ensembldefined exon 1, and it resulted in identification of a potential splice donor region being located adjacent to these extra nucleotides (Fig. S1). Subsequent analyses of Igf2 gene expression by using adjacent 60-bp segments found within the new 5Ј exon to query chicken liver RNA-Seq libraries found in the SRA NCBI database identified several potential additional features of the chicken Igf2 gene. Based on these new results, presumptive exon 1 appeared to be ϳ568 bp in length and consisted of 108 bp of coding DNA and ϳ460 bp of 5Ј UTR (Fig. 1B). Moreover, a potential TATA box, which helps position RNA polymerase II at the start of transcription (43,44), was identified 29 -31 bp 5Ј to the ends of the longest Igf2 transcripts mapped with RNA-Seq libraries (Fig. 1B), suggesting that the start of chicken Igf2 gene transcription may be at this location. However, examination of the region further 5Ј using the Promoter 2.0, CNN Promoter, and the UC Berkeley Neural Network Promoter prediction software did not identify many typical components of vertebrate proximal promoters. Thus, although these data extend general understanding of the architecture of the chicken Igf2 gene, neither the beginning of the gene nor its promoter has been fully established yet.
Genome mapping with the Igf2 cDNA also extended the 3Ј end of the last chicken Igf2 exon and identified two potential 3Ј ends (Fig. 1, C and D). These two regions, which are separated by ϳ535 bp, each have typical characteristics of polyadenylation sites, including a "AATAAA" poly(A) recognition sequence and a putative poly(A) addition site 17 or 21 bp 3Ј to this element (Fig. 1C) (45,46). Taken together, the results described above indicate that the chicken Igf2 gene spans at least 16,499 bp on chromosome 5, contains at least 4 exons and 3 introns, and potentially encodes two protein precursors, but a single mature IGF2 (Fig. 1, D and E, and Table 1).

Characterizing Igf2 genes in terrestrial vertebrates
By using as genomic database queries the four chicken Igf2 exons and species-homologous cDNAs found in the NCBI nucleotide database, Igf2 also appeared to be a 4-exon gene in duck, zebra finch, flycatcher, and Chinese softshell turtle, a 3-exon gene in Anole lizard, and a 5-exon gene in frog. It also probably is a 4-exon gene in turkey, although the first exon, which is over 99% identical to the chicken homologue, could not be mapped to the turkey genome, most likely because of poor DNA sequence quality (Fig. 2, Tables 1 and 2). Moreover, in several of the species examined, the annotated genomic data were as incomplete as those for chicken Igf2 (e.g. three exons and no 5Ј or 3Ј UTRs in turkey and Anole lizard and three exons in flycatcher) or appeared to be unlikely (five proposed exons in duck, including two of 95 and 31 bp separated by a 34-bp intron). When all of the newly identified and mapped information was evaluated, these vertebrate Igf2 genes appeared to be far simpler than their mammalian homologues, which have up to 10 exons, including several noncoding exons, and up to 5 promoters (human and other primate IGF2 genes) (18). A possible exception to this lesser complexity is the frog, Xenopus tropicalis, which appeared to have two 5Ј Igf2 exons: one, termed exon 1, was located more than 27 kb 5Ј to exon 2 and lacks coding potential; the other, termed exon 1a, was ϳ15 kb from exon 2 and contains an ORF (see Fig. 2, Table 1, and below).
DNA sequence identity with chicken Igf2 was similar for all four exons in all species examined (88 -95% for exon 2, 86 -98% for exon 3, 84 -95% for exon 4, and 89 -100% for exon 1, although in the latter case, there was no match in three species; Table 2). Nucleotide similarity among these vertebrates was more variable in the region located 5Ј to chicken Igf2 exon 1, and it ranged from 87% in turkey to 97% in duck, was minimal in zebra finch and flycatcher, and was not evident in turtle, lizard, or frog. Collectively, these latter observations support other evidence noted above that this DNA segment may not represent the proximal Igf2 gene promoter in any of these species.

Igf2 gene expression in terrestrial vertebrates
Analysis of information in the SRA NCBI database demonstrated that Igf2 mRNA was expressed in all species in which

Vertebrate Igf2 gene organization and expression
data were available (Fig. 3). Results from evaluating RNA-Seq experiments for Igf2 transcripts containing exon 2 are depicted for six species from liver, kidney, spleen, and skeletal muscle (Fig. 3A). Further examination of Igf2 gene expression information for frog revealed marked variability in apparent exon usage. Transcripts containing exon 1a predominated when RNA-Seq libraries were interrogated from either male or female liver, as there were 50 -100 times more reads than for exon 1 (Fig. 3B). Moreover, no transcripts were identified that contained parts of both exons, and by contrast mRNAs containing exons 1 and 2 or exons 1a and 2 were both found, although the latter predominated (Fig. 3C). Mapping studies using the same RNA-Seq libraries from liver also demonstrated that frog exon 1 extended at least 26 nucleotides further in the 5Ј direction, and that exon 1a was at least 93 bp longer than recorded in the Ensembl genome browser. Collectively, these results suggest that alternative RNA splicing leads to several classes of frog Igf2 mRNAs with distinct 5Ј ends.

Characterizing Igf2 genes in fish
Zebrafish were selected initially as the index species for studying Igf2 genes in fish, as it has been more extensively examined than other fish species found in the Ensembl or UCSC Genome Browsers. Based on the information in these databases as of August 2018, the zebrafish genome has two Igf2 genes, Igf2a on chromosome 7, containing four exons within 5984 bp of genomic DNA ( Fig. 4A and Table 3), and Igf2b on chromosome 25, also consisting of four exons and spanning 7506 bp ( Fig. 4B and Table 3). Examination of these genes and their corresponding cDNAs obtained from the NCBI nucleotide database showed that the 5Ј end of Igf2b exon 1 matched the 5Ј end of the longest cDNA (AF250289), and the 3Ј end of exon 4 nearly matched its 3Ј end (the last 3 bp are TCA in the gene and GCA in the cDNA). A similarly high degree of DNA sequence identity was found for Igf2a and cDNA NM_131433, which differed only within the first four nucleotides at the 5Ј end of exon 1. Thus, the annotation of both genes appears to be accurate, unlike the situation with chicken Igf2 (Fig. 1). Moreover, both zebrafish Igf2a and Igf2b are structurally similar to chicken Igf2 (compare Figs. 4, A and B, with 1D). However, even though expression of both genes has been demonstrated during

Vertebrate Igf2 gene organization and expression
different zebrafish developmental stages, and in adult tissues (47,48), no gene promoters have been characterized to date, and no studies have been reported on transcriptional control for either Igf2a or Igf2b. Searches using Igf2a exons as queries revealed just short segments of similarity in only seven other fish genomes. Slightly longer matches were noted with Igf2b, although these were found primarily within noncoding portions of exon 4 in 10 species. Searches using chicken Igf2 exons were similarly uninformative. Of note, a fairly low level of DNA sequence identity with other fish had been observed for the zebrafish Igf1 gene and prompted using tetraodon Igf1 exons for mapping this gene in other species (49). The same strategy was subsequently employed here (see below).
In contrast to the two zebrafish Igf2s, which are well annotated, the single tetraodon Igf2 gene has been poorly characterized in Ensembl and in the UCSC Genome Browser. Like zebrafish Igf2a and Igf2b, tetraodon Igf2 was reported to consist of four exons, but unlike the former, it appeared to lack identifiable 5Ј or 3Ј UTRs (Fig. 5A). As there were no tetraodon Igf2 cDNAs in the NCBI nucleotide database, the alternative approach used for chicken Igf2 was employed to map the beginning and end of the gene. Adjacent 60-bp DNA segments found within and 5Ј to presumptive tetraodon exon 1 were used to query the RNA-Seq library, ERX1054374, which was derived from embryo transcripts at 24 h post-fertilization. Results showed that this exon extended for approximately an additional 126 bp in the 5Ј direction (Fig. 5B). Moreover, as seen for A, levels of Igf2 transcripts were determined in liver, kidney, spleen, and skeletal muscle by querying RNA-Seq libraries using 60-bp genomic DNA segments from a region equivalent in each species to the same part of chicken Igf2 exon 2. Results are plotted as the number of sequence reads per species (range ϭ 0 -100). Data were obtained from chicken, turkey, duck, zebra finch, Anole lizard, and frog but were not available for flycatcher or Chinese softshell turtle. B, comparison of Igf2 gene expression in hepatic RNA-Seq libraries from adult male and female frogs, using as probes 60-bp fragments of exon 1, exon 1a, and exon 2. C, comparison of Igf2 gene expression in hepatic and kidney RNA-Seq libraries from adult male frog, using as probes 60-bp fragments derived from exons 1 and 2 (3Ј 30 bp from exon 1 plus 5Ј 30 bp from exon 2) or exons 1a and 2 (3Ј 30 bp from exon 1a plus 5Ј 30 bp from exon 2). A-C, the libraries are listed under "Experimental procedures."

Vertebrate Igf2 gene organization and expression
chicken Igf2 (Fig. 1B), a potential TATA box, which helps position RNA polymerase II at the start of transcription (43,44), was identified 26 nucleotides 5Ј to the ends of the longest Igf2 transcript found in this RNA-Seq library (Fig. 5B).
An analogous strategy was used to map the 3Ј end of presumptive exon 4, and this led to identification of a 3Ј UTR of ϳ3317 bp, and a total exon length of 3557 bp, which included near its 3Ј end an "AATAAA" presumptive poly(A) recognition sequence and a putative poly(A) addition site ( Fig. 5C) (45,46). Taken together, the results described above, defining both 5Ј and 3Ј ends of the tetraodon Igf2 gene, indicate that it spans 7920 bp on chromosome 13 and that it encodes a single protein (Fig. 5, D and E, and Table 3).
Based on the success of these mapping experiments with chicken and tetraodon Igf2, a similar approach was used in seven other fish in which gene annotation was poor: fugu, stickleback, cod, tilapia, platyfish, spotted gar, and medaka, and in which RNA-Seq libraries were available. In six of these fish, the genomic data were nearly as incomplete as those for tetraodon Igf2 (no 5Ј or 3Ј UTRs in cod, fugu, and stickleback, and no 5Ј UTR in spotted gar, tilapia, and platyfish). Screening of RNA-Seq libraries led to the identification of presumptive beginnings and ends for many of these genes (Figs. 6 and 7). For medaka, in which only two Igf2 exons had been identified in the genome, most likely because of poor DNA sequence quality, a combination of genomic searches with a medaka Igf2 cDNA and mapping experiments using a liver-derived RNA-Seq library identified a presumptive exon 1 (but not an exon 2) and extended both exons 1-4 to their presumptive 5Ј and 3Ј ends, respectively (Figs. 6 and 7).

Vertebrate Igf2 gene organization and expression
Additional genomic database searches with tetraodon Igf2 exons, coupled with information from Ensembl and UCSC Genome browsers, and the mapping data illustrated in Figs. 6 and 7, led to the conclusion that Igf2 was a 4-exon gene in fugu, cave fish (both Igf2a and Igf2b), tilapia, Amazon molly, spotted gar, and coelacanth, as well as in zebrafish (Igf2a and Igf2b), and a 5-exon gene in platyfish (Fig. 8). In stickleback, five Igf2 exons were predicted in Ensembl, with a 4-nucleotide intron separating the last two exons. As an intron this small is not feasible (50,51), Ensembl's Igf2 exons 4 and 5 were combined here into a single Igf2 exon 4 ( Fig. 8 and Table 4). There also was no identifiable Igf2 gene in lamprey, even though an IGF2 protein has been characterized in this species (see below). When all of this newly characterized information was evaluated, fish Igf2 genes, like those of terrestrial vertebrates, appeared to be organizationally simpler than their mammalian homologues (Fig. 8) (18). In fact, except for platyfish and frog, with 5 exons, and Anole lizard and possibly medaka, with 3 exons, all the other vertebrate Igf2 genes in the Ensembl or UCSC Genome Browsers appear to be composed of 4 exons and 3 introns (Figs. 2 and 8).
In addition to structural similarity, DNA sequence identity with tetraodon Igf2 was relatively high in all fish species examined. This ranged from 84 to 92% for exon 2, 83 to 96% for exon 3, 87 to 94% for exon 4, and 86 to 94% for exon 1, although in three species there was no match for the latter exon (Table 4).
In terrestrial vertebrate Igf2 genes, an intron divides the exons separating the equivalents of chicken exons 2 and 3 after the first nucleotide of codon 29 of mature IGF2. This is the same codon and codon position and the identical encoded amino acid (serine) found for the intron separating homologous human exons 8 and 9 and mouse exons 6 and 7 (3,4). The identical exon-intron-exon junctions were observed in all terrestrial vertebrates and in all fish Igf2 genes, except for medaka, in which no exon 2 could be identified, and platyfish, with an intron interrupted codon 1 of mature IGF2 (threonine) after the first nucleotide.

Igf2 gene expression in fish
Further analysis of RNA-Seq libraries in the SRA NCBI database demonstrated that Igf2 mRNA accumulated in a variety of different organs, tissues, and developmental stages in different fish ( Fig. 9 and data not shown). Results for Igf2 transcripts containing exon 2 (exon 3 in medaka) are pictured for seven species from liver and for nine from skeletal muscle (Fig. 9). Of note, both Igf2a and Igf2b genes are expressed in liver and in muscle in cave fish and in zebrafish, and fugu Igf2 mRNA is detected at different levels in slow versus fast twitch skeletal muscle (Fig. 9B).

IGF2 protein sequences in nonmammalian vertebrates
The 68-amino acid chicken IGF2 protein resembles the 67-residue human IGF2, as it consists of four domains, termed B, C, A, and D (Fig. 10A) (7). This protein appears to be found within two types of precursors with different N-terminal signal peptides, depending on whether mRNA translation begins at the first or second AUG codon (Figs. 1E and  10A). Among the other species studied here, mature IGF2 was identical to the chicken protein in turkey, duck, and flycatcher; a single amino acid substitution was seen in zebra finch (Ile 39 to Phe), four changes were found in turtle (Arg 30 to Ser, Ile 39 to Phe, Lys 63 to Arg, and Ser 64 to Thr), and multiple differences were detected in the other species, including human ( Fig. 10A and Table 5).
The 70-residue tetraodon IGF2 protein also consists of B, C, A, and D domains (Fig. 10B). Among the other fish studied here, only fugu Igf2 encoded a mature IGF2 identical to the tetraodon protein. In contrast, multiple amino acid substitutions, codon insertions, and/or deletions were found in the other species (range of identity: 58 -91%, Fig. 10B and Table  6). A phylogenetic comparison demonstrated a greater similarity of mature IGF2 among terrestrial vertebrates and coelacanth than among fish, and it also showed clustering of protein sequences among different groups of fish (e.g. cod, stickleback, Amazon molly, tetraodon, fugu, cave fish IGF2b, and zebrafish IGF2b; Fig. 10C).
The two potential chicken IGF2 signal peptides either have 23 or 62 residues. The shorter segment starts with the first methionine codon in exon 2. In contrast, the longer signal sequence is encoded by presumptive exons 1 and 2 (36 and 26 codons, respectively; Figs. 1E and 11, A and B, and Table 5). A smaller signal peptide was found in each nonmammalian terrestrial vertebrate analyzed here. It is 23 amino acids in length and varied in all species from the chicken IGF2 signal peptide, with differences ranging from a single amino acid substitution (turkey) to multiple alterations ( Fig. 11A and Table 5). A longer signal sequence also could be detected in duck, where it is incomplete, and in turtle and frog, but not in other birds ( Fig. 11B and Table 5). An even longer signal sequence of 80 amino acids is predicted for human IGF2 (17), but its similarity with the chicken signal peptide is negligible ( Table 5).
The IGF2 signal peptide in fish is of an intermediate length between short and long chicken signal sequences, ranging from 36 to 53 residues in different species, with amino acid similarity being substantially lower than observed for mature IGF2 (Fig.  11C and Table 6). Of note, nearly all of these signal sequences are predicted to have internal in-frame methionine residues as probes. The DNA sequence below the bar graph illustrates the putative 5Ј end of exon 1, with locations of the 5Ј ends of the longest RNA-seq clones indicated by arrows (arrow size is proportional to the number of clones identified). A possible TATA box is underlined. C, characterizing the putative 3Ј end of tetraodon Igf2 exon 4 using data from RNA-Seq library, ERX1054374, and 60-bp genomic segments a-g as probes. A possible polyadenylation signal is underlined in the DNA sequence below the graph, and the vertical arrow denotes a potential 3Ј end of Igf2 mRNAs with its chromosomal coordinate. D, structure of the tetraodon Igf2 gene based on the analyses shown in B and C. Labeling is as in A. E, diagram of the tetraodon IGF2 protein precursor, with the derivation of each segment from different Igf2 exons indicated. Mature 70-amino acid IGF2 is in blue; the signal peptide is in black; and the E peptide is in red. (Fig. 11C). Because there are no data on the biosynthesis of IGF2 precursors in any nonmammalian vertebrate species, it is not known how effectively mature IGF2 could be generated from a protein precursor with either short or long signal peptides nor which methionine is the initiating residue for protein translation (52, 53).

Vertebrate Igf2 gene organization and expression
The E peptide at the C-terminal end of the IGF2 protein progenitor consists of 89 amino acids in human and mouse (4, 17, 32, 54) but is 96 residues in chicken ( Fig. 12A and Table 5). Except for turkey, in which the IGF2 E region was identical to the chicken segment, it varied in other terrestrial vertebrates in both amino acid sequence and length (e.g. flycatcher, 90 amino acids, 85% identity with chicken; Anole lizard, 95 residues, 67% identity; and frog, 94 amino acids, 43% identity; Fig. 12A and Table 5). The E peptide comprises 98 residues in tetraodon ( Fig.  12B and Table 6), and except for fugu, in which there were only 4 amino acid substitutions versus tetraodon (96% identity), it was variable in other fish species in both length and amino acid sequence similarity (e.g. stickleback, 97 amino acids, 90% identity with tetraodon; platyfish, 103 residues, 59% identity; and spotted gar, 97 residues, 73% identity; Fig. 12B and Table 6). The human E peptide shares little similarity with E domains of either terrestrial vertebrates or fish (34% identity with chicken (Table 5) and 24% with tetraodon (Table 6)). The precise functions of this segment of IGF2 have not been established in any species, although it is present in all mammalian and nonmammalian vertebrates that synthesize IGF2 (4, 32, 54). Fig. 13 depicts maps of the Igf2 locus for the terrestrial vertebrates analyzed here. The locus exhibits several similarities in all of these species in the overall topology of the five genes that are present, Th, Ins, Igf2, Mrpl23, and Tnnt3. In birds, the organization of these genes is congruent, with Th, Ins, and Igf2 defining a cluster of three genes in the same transcriptional direction, separated by 213-236 kb from the other two genes, Mrpl23 and Tnnt3, which are in the opposite transcriptional orientation (Fig. 13). In other terrestrial vertebrates, the relative transcriptional orientation is identical with birds, but the distances between individual genes and the two gene clusters are substantially larger, although in frog only 113 kb separates Igf2 from Mrpl3. Of note, the same five genes are found in the same relative orientation within the human IGF2 locus, although intergenic distances are far shorter than in birds, reptiles, or amphibians (Fig. 13). Also, H19, which expresses a long noncoding RNA (20, 55) and is not found here in nonmammalian vertebrates, is present in humans and in other mammals, along with an imprinting control region (ICR), which regulates recip-

Vertebrate Igf2 gene organization and expression
rocal parental chromosome-of-origin-specific expression of IGF2 and HI9 in humans and in other mammalian species (23)(24)(25)(26)36). The Igf2 locus in fish appears to be simpler than in terrestrial vertebrates, as only two other genes, Th and Mrpl23, are present (Fig. 14). The location and orientation of these genes in the locus are highly similar among the 10 teleost fish studied here but are less so in the nonteleost, spotted gar, in which Th is found 5Ј to Mrpl23 (Fig. 14), indicating that an apparent chromosomal rearrangement had occurred after the evolutionary separation of teleosts and nonteleosts. These results also support the idea that Igf2b is likely to be the ancestral Igf2 gene, based on it being embedded in a locus with shared features that also are found in birds and mammals (Figs. 13 and 14), and on the fact that cave fish and zebrafish Igf2a loci lack these other genes (data not shown). Moreover, although in medaka and cod Mrpl23 is apparently not found within this

Vertebrate Igf2 gene organization and expression
locus, this could reflect the more incomplete quality of their genome assemblies, because it is present in both species (data not shown). A similar quality control problem may be true for the coelacanth genome, in which Mrpl23 maps near Tnnt3 (data not shown), as is observed in both terrestrial vertebrates (Fig. 13) and mammals (18), but could not be localized near Igf2 (Fig. 14). Taken together, these observations demonstrate that several features of the Igf2 locus have been retained during more than ϳ500 Myr of vertebrate and mammalian speciation, and thus they argue that the Igf2 gene and locus are phylogenetically ancient.

Igf2 genes in vertebrates
The goals of the studies presented here were to understand the organization and patterns of expression of Igf2 genes in nonmammalian vertebrates by mining the resources of public databases and to place these findings in an evolutionary context with mammalian IGF2/Igf2 homologues and the mammalian IGF2/Igf2-H19 locus. In mammals, IGF2 is involved principally in mediating prenatal growth (8), but it also functions in other aspects of physiology and pathophysiology throughout life (9 -16). Mammalian IGF2/Igf2 genes are complicated and reside within a complex multigene locus (17,18,36,56). In humans and in mice, multiple gene promoters (5 for human and 4 for mouse) control production of many different types of IGF2/Igf2 mRNAs that are translated and processed into a single mature 67-amino acid IGF2 (4,32,54). In both species, IGF2/Igf2 gene promoter activity is regulated by developmental and tissue-specific mechanisms that in turn are controlled by paternal chromosome-of-origin parental imprinting that is reciprocal to the expression of H19 (25,26,57,58). Similar processes are presumably operative in other mammalian species, although they have not been characterized as fully as in mice and humans (36,56).
The genomic and gene expression data identified and analyzed here show that Igf2 genes are far simpler in nonmammalian vertebrates than in mammals (Figs. 2 and 8) and that the locus also is simpler (Figs. 13 and 14). In most of the species described in this paper, the Igf2 gene is composed of 4 exons and 3 introns and likely has a single gene promoter, although this has not been established experimentally as yet. Exceptions include frog, in which there are 5 exons and evidence for alternative RNA splicing (Figs. 2 and 3 and  Table 1), platyfish, which also has 5 exons ( Fig. 8 and Table 3), and possibly Anole lizard and medaka, in which a homologue of exon 1 or exon 2, respectively, could not be identified (Figs. 2 and 8 and  Tables 1 and 3). Moreover, in terrestrial vertebrates and in mammals, and in most of the fish species studied here, the Igf2/IGF2  Figure 9. Igf2 gene expression in fish. Igf2 transcript levels were identified in liver (A) and in skeletal muscle (B) by querying RNA-Seq libraries using 60-bp genomic DNA segments from a region equivalent in each species to the same part of tetraodon Igf2 exon 2 (or * exon 3 in medaka). Results are plotted as the number of sequence reads per species (range ϭ 0 -100). Information was not available from either liver or muscle for Amazon molly, platyfish, or tetraodon or for liver for coelacanth. A and B, libraries searched are listed under "Experimental procedures."

Vertebrate Igf2 gene organization and expression
gene is present in a single copy in the genome (22). The exceptions are zebrafish and cave fish, in which there are paralogous Igf2 genes termed Igf2a and Igf2b (Fig. 8 and Table 3). This latter finding reflects the fact that in a common ancestor of extant rayfinned fish, the entire genome was duplicated ϳ320 -350 Myr ago (59) and that this duplication was followed by rediploidization in progenitors of many modern teleost lineages (59). However, in some species, such as zebrafish, a substantial fraction of duplicated genes has been retained (60). In both zebrafish and cave fish, the paralogous genes have diverged from one another, as amino acid identities between mature IGF2a and IGF2b are 76 and 84%, respectively, in the two species, and thus are less similar to each other than the corresponding IGF2b proteins are to tetraodon IGF2 (90 and 86%, Table 6). By these criteria, and by other similarities within the locus, it is clear that in both

Vertebrate Igf2 gene organization and expression
zebrafish and cave fish Igf2b represents the descendant of the original Igf2 locus. Despite less complexity than in mammals, Igf2 genes in vertebrates share some common features with mammalian IGF2/Igf2. In all species studied here, except for platyfish (and medaka, which because of poor genomic sequence quality could not be evaluated), an intron splits the exons that encode the mature IGF2 protein at the identical location (these are the equivalents of exons 2 and 3 in nonmammalian vertebrates, exons 8 and 9 in human, and exons 6 and 7 in mice), interrupting these exons between the first and second nucleotides of serine codon 29 of mature IGF2 (3, 4). Conserved intron positioning also is found in Igf1 genes from mammals and nonmammalian vertebrates, as in all of these species a large intron interrupts exons encoding the mature IGF1 protein between the first and second nucleotides of codon 26 (49).
The Igf2 locus in nonmammalian vertebrates also is simpler than in mammals. There is no equivalent of the H19 gene, and no apparent ICR or distal enhancers, as mapped in mammals (57), although in the absence of a functional promoter, enhancers would be difficult to identify experimentally. However, several of the genes mapped to the locus in mammals also are found in terrestrial nonmammalian vertebrates in the same order and transcriptional orientation, including Th, Ins, Mrpl23, and Tnnt3 in terrestrial species (Fig. 13), and Th and Mrpl23 in most fish (Fig. 14). Collectively, these results suggest that the Igf2 locus is phylogeneti-cally old, as it is found in vertebrates separated by over 500 Myr of evolutionary diversification.

Igf2 gene regulation in vertebrates
There is minimal published information on Igf2 gene expression in nonmammalian vertebrates, with studies being limited to a few analyses of chick embryos (61)(62)(63)(64), turkeys (40), ducks (39), zebra finch (38), some observations in zebrafish and medaka embryos (47,65), and measurements of transcripts in different organs, tissues, and cell types from zebrafish, tilapia (48,66,67), and a few other fish species (68,69). The data presented here using queries of RNA-Seq libraries from the SRA NCBI repository extend previous analyses and show that Igf2 transcripts are produced in different adult tissues in a number of nonmammalian vertebrates (Figs. 3 and 9). However, mechanisms of gene regulation are unknown, and no Igf2 gene promoter has been functionally identified to date in any of these species. The situation is potentially different for Igf1, in which conserved putative transcription factor-binding sites have been mapped to positions analogous to those characterized experimentally in mammals (49) In mammals, genetic, epigenetic, and environmental factors all contribute to somatic growth (70,71), and also influence IGF2/Igf2 gene expression and protein production (9 -12). For example, in humans, alterations in levels of IGF2 are associated with genetically determined overgrowth and undergrowth disorders, respectively, termed Beckwith-Wiedemann and Silver-

Vertebrate Igf2 gene organization and expression
Russell syndromes (11,12). It is not known whether similar growth disorders connected with IGF2 occur in nonmammalian vertebrates. However, DNA polymorphisms have been identified in chickens within the Igf2 locus that sort with somatic growth and carcass weight (72, 73), and experimental selection for body size in zebrafish has been found to be associated with changes in expression of components of the insulinlike growth factor system, including alterations in levels of Igf2 transcripts (74).

IGF2 proteins in vertebrates
Mature IGF2 in most mammals is a 67-amino acid singlechain protein consisting of domains termed B, C, A, and D that are related to the analogous parts of IGF1 (4) and also resemble the B and A chains of mature insulin and the C chain of proinsulin (7). In all terrestrial vertebrates examined here, mature IGF2 is 68 residues in length (Table 5), and the proteins are more similar to each other than are IGF2 proteins in fish, where IGF2 ranges from 66 residues (lamprey) to 71 residues (platyfish and Amazon molly), although in the majority of species it is 70 amino acids (Fig. 10C, Table 6). In terrestrial vertebrates, the A domains are nearly identical, with only a single amino acid alteration being found in lizard and frog, and B domains also are highly similar (just two differences in lizard and frog), whereas the C region is more divergent in reptiles and amphibians (Fig. 10A). In contrast, in fish, there are several amino acid changes in both A and B domains and more in the C region ( Fig. 10B; exceptions are tetraodon and fugu IGF2, which are identical).
Some mammals, including humans, also express a 70-amino acid form of IGF2 that results from use of an alternative splice acceptor site that adds four codons instead of one to the equivalent Figure 11. Alignments of vertebrate IGF2 signal sequences. A, amino acid sequences of IGF2 signal peptides from eight terrestrial vertebrates in single-letter code. Differences are shown, and identities are indicated by dots. B, amino acid sequences of longer IGF2 signal peptides. Differences are indicated, and identities are signified by dots. A dash indicates no residue. Bold red text is identical to the short signal sequence shown in A. No longer signal peptides could be detected for turkey, zebra finch, flycatcher, or Anole lizard (ϭ none). C, amino acid sequences of IGF2 signal peptides from 12 fish species and lamprey in single-letter code. Differences are shown, and identities are indicated by dots, and a dash signifies no residue. B and C, dashes have been placed to maximize alignments.

Vertebrate Igf2 gene organization and expression
of the 5Ј end of human IGF2 exon 9, leading to an IGF2 with a longer C domain (15 residues instead of 12 (17)). This longer protein binds to the IGF1 receptor with lower affinity than the 67-residue human IGF2 (76). Although there does not appear to be alternative RNA processing in fish Igf2 genes, the IGF2 protein also has a 15-residue C domain (Fig. 10B). It would be of interest to learn whether a longer C region lowers the affinity of IGF2 for the IGF1 receptor in different fish species and to determine whether the modification of a shorter C domain generally enhances this ligand-receptor interaction in mammals.
Other features of IGF2 protein precursors are similar between nonmammalian vertebrates and mammals. In all species studied, the IGF2 progenitor contains a C-terminal extension or E peptide (Fig. 12) that is cleaved by a posttranslational proteolytic processing step (4,54). IGF1 precursors in both mammals and nonmammalian vertebrates also contain E peptides that are more divergent than other parts of the protein (49, 75), as is seen here with IGF2 ( Fig. 12 and Tables 5 and 6). In many mammals (54) and in four of the terrestrial vertebrates analyzed here, Igf2 mRNAs encode two alternative signal peptides (Fig. 11C). One is of a typical length for secreted proteins, 23 amino acids in terrestrial vertebrates and 24 residues in many mammals (52)(53)(54), whereas the other is substantially longer, 53-62 amino acids in these four vertebrates ( Fig. 11B and Table 5), and 80 residues in humans (18). In fish, the lone IGF2 signal sequence is of intermediate length, 36 -53 amino acids, and in nearly all species it contains an internal in-frame methionine residue (Fig. 11). As there is no experimental evidence in any nonmammalian vertebrate addressing IGF2 biosynthesis, the methionine or signal peptide responsible for initiating protein translation is unknown.

Improving gene quality in genome databases
Publicly available genomic repositories contain extensive data on different genes from many animal species, yet as shown

Vertebrate Igf2 gene organization and expression
here, much of the information for Igf2 in vertebrates is incompletely or incorrectly annotated. This problem does not appear to be uncommon, as similar deficiencies have been shown for Igf1 in both mammals and nonmammalian vertebrates (49,75). It is likely that other genes in these databases also are not described accurately, and it suggests that a concerted effort is needed to improve these data for the general benefit of the scientific community and to accelerate future discoveries. One way to accomplish this goal would be to query different RNA-Seq libraries for transcripts that map to portions of specific genes, as illustrated here for the potential 5Ј and 3Ј ends of Igf2 genes from a number of species, and for exon 1 and exon 1a in frog (Figs. 1, 3, and 5-7). A similar strategy also could be used to identify intron-exon and exon-intron junctions and to determine the relative prevalence of alternative RNA splicing (Fig. 3C).

Final comments
Conservation of components of the Igf2 gene and locus and the similarity of IGF2 among mammals and nonmammalian vertebrates suggest that an IGF2 was present in a common vertebrate ancestor (77,78). Aspects of the biology of IGF2 have been maintained for ϳ500 Myr of speciation, an idea that with further investigation may lead to new insights into the comparative biology of IGF2 regulation and actions.

Database searches and analyses
ures are presented as percent identity over entire query regions, unless otherwise specified.

Experimental strategy
Naming conventions adopted here include the abbreviation "Igf2"' for all genes and mRNAs except for human, for which "IGF2" is used, and "IGF2" for all proteins. An initial assessment of nonmammalian vertebrate Igf2 loci, genes, and potential transcripts within Ensembl and UCSC genome browsers revealed that most genes were simpler than human IGF2 or mouse Igf2. However, very few gene assignments appeared to have taken into account available published experimental data. For example, in chicken Igf2, the two genomic databases showed three exons, but a comparison with an Igf2 cDNA from the NCBI nucleotide data resource (XM_015286525) suggested that an additional exon existed. Also, genome databases showed that tetraodon Igf2 consisted of 4 exons, with the first exon beginning with the ATG codon for the IGF2 precursor and the last exon ending with the TGA translational stop codon, clearly demonstrating that the gene had been incompletely defined. Thus, primary goals were to characterize all genes as completely as possible and then to interpret these more extensive datasets. An iterative process was developed that began with the exon assignments for all vertebrate Igf2 genes as defined in Ensembl and UCSC browsers. Depending on the species, these assignments were based on the different analytical approaches that had been used to characterize each specific genome (see Table S1). Next, the chicken Igf2 gene was characterized by a combination of steps that included mapping the gene with its cDNAs and assessing 5Ј and 3Ј ends by querying RNA-Seq libraries with presumptive exon fragments (Fig.  1). BlastN searches then were conducted against all other terrestrial vertebrate genome assemblies for homologous genomic regions using the chicken Igf2 gene fragments as queries. These latter results were mapped to each vertebrate Igf2 locus and were followed by secondary searches relying on cDNAs, gene components from other species, and RNA-Seq libraries. An analogous approach was used for fish Igf2 genes. BlastN searches were conducted against all 13 genome assemblies for homologous genomic regions using segments of chicken Igf2 and zebrafish Igf2a and Igf2b genes as queries. Because limited information was obtained, subsequent BlastN searches were performed using tetraodon Igf2 exons, after the 5Ј and 3Ј ends of the gene had been mapped using RNA-Seq files from SRA NCBI. Results of each series of genome searches then were mapped to each fish Igf2 locus and were followed by secondary searches relying on cDNAs or gene fragments from other fish species and tertiary mapping of potential 5Ј and 3Ј ends by screening RNA-Seq files from SRA NCBI. Through these steps, all vertebrate Igf2 genes were defined more completely in most species than had been annotated in either Ensembl or UCSC genome browsers.