The complex genetics of human insulin-like growth factor 2 are not reflected in public databases

Recent advances in genetics present unique opportunities for enhancing knowledge about human physiology and disease susceptibility. Understanding this information at the individual gene level is challenging and requires extracting, collating, and interpreting data from a variety of public gene repositories. Here, I illustrate this challenge by analyzing the gene for human insulin-like growth factor 2 (IGF2) through the lens of several databases. IGF2, a 67-amino acid secreted peptide, is essential for normal prenatal growth and is involved in other physiological and pathophysiological processes in humans. Surprisingly, none of the genetic databases accurately described or completely delineated human IGF2 gene structure or transcript expression, even though all relevant information could be found in the published literature. Although IGF2 shares multiple features with the mouse Igf2 gene, it has several unique properties, including transcription from five promoters. Both genes undergo parental imprinting, with IGF2/Igf2 being expressed primarily from the paternal chromosome and the adjacent H19 gene from the maternal chromosome. Unlike mouse Igf2, whose expression declines after birth, human IGF2 remains active throughout life. This characteristic has been attributed to a unique human gene promoter that escapes imprinting, but as shown here, it involves several different promoters with distinct tissue-specific expression patterns. Because new testable hypotheses could lead to critical insights into IGF2 actions in human physiology and disease, it is incumbent that our fundamental understanding is accurate. Similar challenges affecting knowledge of other human genes should promote attempts to critically evaluate, interpret, and correct human genetic data in publicly available databases.

Insulin-like growth factor 2 (IGF2), 2 a conserved 67-amino acid, single-chain secreted protein, is intimately involved in several critical physiological processes in humans and other mammals (1)(2)(3)(4)(5)(6). IGF2, along with IGF1 and insulin, also comprises a gene family that is found in most mammalian species (7)(8)(9). In humans, IGF2 plays a central role in normal pre-natal growth, and changes in its expression have been implicated in a variety of developmental growth syndromes (10,11). For example, individuals with Silver-Russell syndrome produce diminished amounts of IGF2, have reduced fetal and post-natal somatic growth, and exhibit a variety of dysmorphic features and bodily asymmetry (10,11). In contrast, in Beckwith-Wiedemann syndrome, which is characterized by asymmetric overgrowth (10,11), overexpression of IGF2 is seen.
IGF2 also plays an important role in growth biology in other mammals. Targeted Igf2 gene knockout in mice caused impaired fetal but normal post-natal growth (12). IGF2 also was identified as a key quantitative trait locus for muscle mass in pigs, with more heavily muscled animals bearing a single nucleotide polymorphism in an IGF2 gene promoter (13), that was responsible for enhancing gene activity by interfering with binding of a transcriptional repressor (14,15).
Igf2/IGF2 gene regulation is complicated. In mice, humans, and presumably in other mammalian species, Igf2/IGF2 is part of a conserved autosomal linkage group that undergoes parental imprinting (16). Igf2/IGF2 is expressed from the paternally derived chromosome, with the adjacent H19 gene being activated on the maternal chromosome via the epigenetic actions of an imprinting control region (ICR) found in intergenic chromatin ( Fig. 1A) (16). DNA within the ICR contains sites for the nuclear protein, CTCF (16 -18). CTCF binds to the ICR in maternally-derived chromatin in mice, and it facilitates access of distal enhancers to the H19 promoter while preventing interactions with Igf2 promoters (17,18). In paternal chromatin, DNA in the ICR is methylated, and CTCF cannot bind, thus allowing the same enhancers to associate with Igf2 (17,18). It is likely that similar mechanisms control both IGF2 and H19 expressions in humans, although experimental evidence is more limited (19).
In mice, the Igf2 gene is composed of eight exons, and gene expression is regulated by three adjacent promoters, termed P1-P3, and an upstream promoter, P0, each with its own unique non-coding 5Ј leader exon, whereas exons 6 -8 encode the IGF2 precursor protein (Fig. 1A) (20 -22). The human IGF2 gene is more complicated, and based on the published literature, it contains an additional upstream promoter with its associated non-coding exon and another coding exon (Fig. 1 cro ARTICLE (23)(24)(25)(26). Because Igf2 expression declines precipitously during the early post-natal period in rodents (27)(28)(29)(30), but is maintained through adult life in humans (24,31), it has been assumed that the extra IGF2 gene promoter in humans is responsible for these species-specific differences in gene regulation (32). Moreover, as it has been shown that this distinctive human promoter is expressed in both alleles in some human tissues and thus in certain circumstances does not undergo parental imprinting (33), it has been hypothesized that lack of imprinting control also may be a key feature of maintenance of IGF2 production in adult humans (33). In both cases, the experimental evidence supporting these contentions is limited.
Major advances in genetics and genomics in recent years now present unique opportunities for enhancing our understanding of human physiology, and a wealth of information is now available in publicly accessible databases (34) with the potential for contributing new insights into human variation, disease predisposition, and human origins and evolution (35)(36)(37). To date, understanding these data at the individual gene level has been challenging, in part because both the extraction of the information and the evaluation of its quality can be difficult. Here, I have used publicly available genomic and gene expression databases to examine the human IGF2 gene. This study was prompted in part by recent publications potentially linking IGF2 DNA polymorphisms with a variety of traits, including differential susceptibility to type 2 diabetes (38) and association with athletic performance (39).

Topography of the human IGF2-H19 locus
The human IGF2-H19 locus on chromosome 11p15.5 shares many features with the corresponding mouse locus on chromosome 7 (26), including gene order, and the presence of an imprinting control region (ICR), located just 5Ј to H19 and containing binding sites for the boundary transcription factor, CTCF (40,41), although in the human genome the segment differs from the mouse by consisting of two repeat units rather than a single element ( Fig. 1A) (42). Also, as revealed here by homology searches, in the human IGF2-H19 locus there are DNA sequences similar to distal enhancers that are responsible in the mouse for both developmental and cell-lineage-specific regulation of Igf2 and H19 ( Fig. 1A and Table 1) (43). However, in the human genome only 9 of the 10 elements defined in the mouse could be detected, with conserved segment 3 being absent (Table 1). In fact, with the exception of human conserved segment 8, in which DNA sequence similarity extended throughout the 277-bp mouse segment, DNA homology was limited to 50% or less of the corresponding mouse element (Table 1). In addition, although in the mouse genome these enhancers are located in intergenic DNA, on the corresponding human chromosomal region, 7/9 elements are found within the MRPL23 gene (Fig. 1A). Moreover, no experiments have yet shown that any of these putative enhancers are functional in human tissues, in contrast to studies using transgenic mice and other experimental models that have defined several chromosomal segments that could direct gene activity to endodermalor mesenchymal-derived tissues (43)(44)(45)(46)(47)(48)(49). Other differences between the mouse Igf2-H19 and human IGF2-H19 loci include the separation of Th in the former species to 226 kb from Ins2 versus 2.6 kb from INS in humans (Ins2 in mice is the homologue of human INS), a shorter intergenic distance between Igf2 and H19 (72 kb in mice versus 128 kb in humans) and a longer distance between Ins2 and Igf2 (12 kb in mice versus 1.4 kb in humans) (Fig. 1A). The anatomical relationship in the human genome among TH, INS, and IGF2 is depicted in Fig. 1B.

Human IGF2 gene
Based on compilation of primary data (23-26, 50 -53), the human IGF2 gene spans ϳ29.3 kb of chromosomal DNA and contains five promoters and 10 exons ( Fig. 2A). This information is not fully delineated in Ensembl, UCSC, or Gencode Genome Browsers or in other databases. In fact, the annotation of human genome assembly GRCh38/hg38 in each browser only lists three of five human IGF2 promoters (P2-4) and seven The five human IGF2 gene promoters described in the published literature and partially identified in genome browsers govern the production of six different major classes of IGF2 transcripts with five distinct 5Ј UTRs and two types of coding segments (Fig. 2, A and B) (25, 26, 50 -53). P2 is responsible for a pair of IGF2 mRNAs that differ by the presence or absence of alternatively spliced exon 5. Transcripts directed by P1 include three untranslated exons, exons 1-3 (Fig. 2B). Exon 2 in this context is truncated at its 5Ј end compared with when it is the sole 5Ј-untranslated exon in mRNAs regulated by P0 (Fig. 2B). P3 and P4 each direct expression of IGF2 mRNAs with single 5Ј-untranslated exons (Fig. 2B). There also is variation in the lengths of 3Ј UTRs affecting most IGF2 mRNAs, secondary to alternative polyadenylation, as noted above (see Fig. 2B). The four transcripts controlled by P1, P0, P3, and P4 encode the same 180-amino acid IGF2 precursor proteins (Fig. 2, B and C). This precursor is composed of a 24-amino acid signal peptide, a 67-residue mature IGF2 molecule, and an 89-amino acid COOH-terminal extension peptide, or E-domain (Fig. 2C). The signal peptide is encoded by exon 8, mature IGF2 by exons 8 and 9, and the E-region by exons 9 and 10 (Fig. 2C). The presence of exon 5 in some mRNAs directed by P2 leads to a second predicted IGF2 protein precursor of 236 residues. This molecule is composed of the same 67-residue IGF2 and 89-amino acid E-peptide, but it contains an 80-amino acid presumptive signal peptide (54 codons from exon 5 and 26 from exon 8, with the additional two 5Ј codons in exon 8 resulting from maintenance of the open reading frame after splicing of exons 5 and 8) (Fig. 2C).
Three non-coding genes are located within the boundaries of the human IGF2 gene. All are annotated in Ensembl, UCSC, and Gencode Genome Browsers. Human IGF2 antisense (IGF2AS) is a 3-exon gene that is transcribed in the opposite direction from IGF2. One of its exons overlaps IGF2 exon 3 ( Fig.   2A). MiR483 is composed of a single exon and is found between IGF2 exons 8 and 9 ( Fig. 2A). It has been shown to be up-regulated in human Wilms tumors (54) and has been found in experimental studies to increase transcription from IGF2 P2, P3, and P4, and IGF2AS, although by unknown mechanisms (54). Both of these genes are located in corresponding positions within the mouse Igf2 locus. AC132217.1 is a predicted noncoding gene, with limited functional annotation, and does not appear to be present in the mouse genome.
Few studies have been performed examining the molecular mechanisms controlling the activity of each of the five human IGF2 promoters. Limited cell-based experiments have shown that human promoters P1, P3, P4, and P0 can function in transfected cells, whereas P2 cannot (25,53,58,59). Transcription factors C/EBP␣, C/EBP␤, and SP1 have been found to activate P1 through potential proximal promoter control elements (32,60), whereas AP2 was shown to stimulate P3 (61), and the Wilms tumor repressor, WT1, was found to inhibit P3, potentially through many binding sites in proximal promoter DNA (62). In vivo, P0 has been shown to be active in several human fetal tissues, but only in adult skeletal muscle (25), whereas other promoters are expressed in a variety of different organs and tissues (23-26, 50 -53).
There have been several analyses of parental imprinting of IGF2 in humans, which potentially controls access of distal transcriptional regulatory elements and their associated transcription factors to all IGF2 gene promoters (19, 33). Investigations have identified individuals with deletion mutations in the human ICR, which leads to bi-allelic expression of IGF2 and silencing of H19 (42), and also have found bi-allelic expression of IGF2 in several human cancers in which excess production of IGF2 may play a pathogenic role (55)(56)(57).

New insights regarding human IGF2 gene expression
GTEx has collected gene expression data by RNA-sequencing from many non-diseased human tissues (63-65), with 80 -388 donors per tissue (GTEX release 7). Presented in Fig. 3 are the compiled data for IGF2. These results are based on information about isoform-specific transcripts in GTEx and have been quantified here according to the IGF2 promoter being used, which was determined by the presence of a distinct promoter-specific 5Ј-untranslated exon or exons. In these different organs and tissues, the overall abundance of IGF2 transcripts varied over a 34-fold range, with the highest levels of mRNAs seen in visceral adipose tissue, endo-cervix, and fallopian tube (103, 93, and 92 transcripts for kilobase million reads (TPM), respectively, Fig. 3A), and the lowest values in brain regions, pancreas, and stomach (2.9, 3.2, and 3.8 TPM, respectively, Fig. 3B). The vast majority of IGF2 mRNAs in this collection of human tissues and organs were regulated by P3 and P4, which together comprised 76 to Ͼ99% of total transcripts (Fig. 3). IGF2 mRNAs controlled by P1 were measurable only in liver (ϳ15% of total), and the contributions of P0 and P2 were negligible (1% or less; some of the IGF2 mRNAs in the GTEx dataset did not contain a promoter-specific exon and may have arisen from endonucleolytic processing of larger IGF2 transcripts (66, 67)). Taken together, the GTEx results on IGF2 expression in human liver as presented here clearly undercut

Human IGF2 gene structure, function, and regulation
prior hypotheses about P1 being the major driver of IGF2 production in that organ (32,33), because it appears to be a minor species in this dataset. In fact, because mRNAs governed by P3 and P4 are the predominant active human promoters in adult liver and in other organs and tissues (Fig. 3), the results also argue against the idea that P1, which is not found in the mouse genome, is responsible for maintaining IGF2 gene transcription in human adults (33). Thus, the GTEx data collectively indicate that the cause of pervasive IGF2 mRNA biosynthesis and protein production throughout life in humans remains unknown and that the reasons for differences with mice, in which Igf2 expression is attenuated during the early post-natal period, also are undefined. These results therefore present a new opportunity to elucidate the molecular and biochemical mechanisms responsible for temporal control of IGF2/Igf2 gene activity in different mammalian species. Of note, analysis of mouse Igf2 promoter usage via the FANTOM5 promoter expression atlas (68, 69) demonstrates that P2 (homologue of human P3) and P3 (homologue of human P4) predominate in fetal and neonatal organs and tissues, although mouse P1 (homologue of human P2) is more active than P2 in neonatal murine liver (68, 69). Thus, some patterns of tissue-and organ-specific Igf2/IGF2 promoter activity appear to be similar between mice and humans.

Predicted variation in IGF2 in human populations
The Exome Aggregation Consortium (ExAC) contains DNA sequence information from the exons of 60,706 people repre-

Human IGF2 gene structure, function, and regulation
senting different population groups from around the world (70). The data have revealed substantial variation within the coding regions of genes in this large cohort, but the data have also shown that most alterations were uncommon, as more than half were detected in a single allele and over 99% were seen in less than 1% of the study group (70). In addition, the Human IGF2 gene structure, function, and regulation vast majority of this variation has consisted of synonymous nucleotide changes and amino acid substitutions (70). Alterations in the IGF2 gene were seen in 2.5% of the 121,412 alleles, with 80% of this being reflected in a single nucleotide change mapping to the splice acceptor site at the intron 4 to exon 5 junction (71) (found in ϳ2% of the total population; also see single nucleotide polymorphism rs149483638). Nearly all of the remaining ϳ0.5% consisted of 78 different predicted amino acid substitutions (71). However, the single change of arginine 157 to histidine in the E-peptide region predominated and accounted for ϳ75% of all such substitutions (71). The impact of these variants on IGF2 biology is unknown, and they thus collectively present new research opportunities for obtaining insights into aspects of human physiology (see below).

Potential alterations in IGF2 in cancer
The Cancer Genome Atlas represents a compendium of DNA sequencing results from genomic DNA derived from a variety of human neoplasms (see https://portal.gdc.cancer. gov/). Potential mutations at 149 different locations in the IGF2 gene have been found in 24 different cancers. The prevalence of these changes has ranged from over 7% of uterine endometrial carcinomas to Յ0.5% of invasive breast cancers, adenocarcinomas of the prostate, renal cancers, gliomas, or glioblastomas, among others (see https://portal.gdc.cancer.gov/ genes/ENSG00000167244). Of the 149 different alterations identified in the IGF2 gene in these human cancers, 87 were found in IGF2 exons and 62 in introns (Tables 2 and 3). Considering just changes within exons that mapped to mature IGF2 transcripts, 36 predicted missense mutations, frameshifts, or the addition of a stop codon to IGF2 ( Table 3). Twelve of these locations were identified as human population variants in ExAC, although their prevalence was extremely low and ranged from 1 to 11 alleles (Ͻ0.01%, Table 3). At present, the roles of these modified transcripts and IGF2 proteins in cancer biology are undefined, and the potential is thus high for gaining insights into aspects of carcinogenesis or metastasis by studying these mutations.

Challenges inherent in genome-based predictions about putative human IGF2 proteins
IGF2 mRNAs derived from P1, P0, P3, and P4 and the smaller transcript originating from P2 all encode a predicted secreted IGF2 protein precursor of 180 residues (Fig. 2C), which closely resembles IGF2 proteins found in other mammalian species (7,8). Once these mRNAs are translated, the 24-amino acid NH 2terminal signal peptide is co-translationally cleaved from the rest of the protein progenitor, and the 89-residue COOH-terminal E-peptide is post-translationally processed to release the fully bioactive 67-residue IGF2 (6,31). The fate or function of the putative 236-amino acid IGF2 precursor molecule originating from the larger transcript governed by P2 is unknown (Fig.  2C). Moreover, the potential 80-amino acid signal peptide in this predicted protein is far longer than other described mammalian signal sequences (72,73), and in the absence of any experimental evidence, it is not known whether an authentic IGF2 pro-protein or 67-amino acid mature IGF2 can be generated from this molecule or even whether the mRNA is stable in human cells.
The listing of the larger transcript regulated by IGF2 P2, and the encoded putative 236-amino acid IGF2 precursor protein as a major product of the human IGF2 gene in genome databases such as ExAC and others, has the potential to cause confusion for users searching for actionable information about IGF2. Moreover, this inference is wrong, based on GTEx gene expression data, which, as noted above, indicates that this is at best a very minor IGF2 mRNA species in multiple human tissues. One Figure 3. Expression of IGF2 mRNAs in different human tissues. Data on organ-and tissue-specific human IGF2 gene expression was obtained from the GTEx portal and graphed as TPM. The information shown has been derived from the exon expression module of GTEx, and the displayed results have been quantified by the abundance of promoter-specific 5Ј-untranslated exons. A, 18 organs and tissues with the highest levels of IGF2 mRNAs are depicted. B, 20 organs and tissues with the lowest IGF2 mRNAs are shown. The data from human brain consist of the average levels of IGF2 mRNAs found in 12 different sub-regions (cortex, frontal cortex, anterior cingulate cortex, amygdala, hypothalamus, hippocampus, substantia nigra, putamen, caudate, nucleus accumbens, cerebellum, and cerebellar hemisphere).  Human IGF2 gene structure, function, and regulation example of the problems created by this type of incorrect information may be found in the recently published analysis of a family with apparent Silver-Russell syndrome, including both pre-and post-natal growth deficiencies (74). DNA sequencing revealed that the genomes of all affected members harbored a heterozygous nonsense mutation in IGF2 that prevented production of the IGF2 protein, and indeed, accumulation of IGF2 in the blood of the subjects was substantially reduced from normal concentrations (74). The mutation, which consisted of a single Cys 3 Ala change within IGF2 exon 8 in all subjects, was described as altering amino acid serine 64 to a termination codon (TCGϾTAG) in the 80-amino acid presumptive signal peptide (thus named p.Ser64Ter). More accurately, this modification should be represented as p.Ser8Ter, reflecting its location in the signal sequence encoded by major IGF2 transcripts.

How can we use genomic databases to accurately understand IGF2 actions in humans?
A recent publication identified the single nucleotide change mapping to and disrupting the IGF2 intron 4 to exon 5 splice acceptor site as being prevalent in a cohort of 8668 individuals of Mexican origin (average minor allele frequency (MAF) of 0.17 (38)). As noted above, this identical alteration was found with a MAF of 0.02 in the ExAC study group, which consisted of different populations, with 60% of individuals with European origins, ϳ8% with Hispanic origins, and ϳ8% from Africa (70). This new analysis also showed that the allele that perturbs the splice acceptor site was rare in other groups (MAF of 0.001 in Africans and 0.0002 in Europeans), and thus that the contribution of the Hispanic cohorts was likely primarily responsible for the overall observed ExAC population prevalence of ϳ2% (71). By studying individuals with and without type 2 diabetes, the authors additionally concluded that the minor allele was protective against this disease, with heterozygous individuals having a 22% and homozygotes a 40% diminished risk (38). Although intriguing, it remains unclear how modifications in this genomic location could alter disease risk, because transcripts directed by IGF2 P2 are normally minimally expressed in human tissues (Fig. 3). One possibility is that the minor allele can alter levels of INS-IGF2 read-through transcripts (25), although the abundance of these latter mRNAs and their functions in different human tissues has not been determined. Moreover, they are not found in GTEx except in liver, where expression levels are very low (Յ1 TPM for each of 4 mRNA species). In addition, because the genetics of Mexico are very complicated, with region-specific sub-groups being identified (75), it is likely that the MAF of 0.17 identified in the study cohort (38) will vary significantly among different populations of Mexican origin once a larger and more thorough genetic analysis is completed.
The statistical interpretation of the population genetics data and the associated claim of a new disease risk allele (38) noted above illustrate one of the major quandaries and challenges of modern genetics: how to connect clinical and basic research to accurately and mechanistically understand disease causation (34). As analyzed here for the human IGF2 gene, the difficulties in attaining this goal are possibly even more complicated. The complexity of the gene, with its multiple promoters, alternative splicing, overarching regulation by parental imprinting, and the possibility of two distinct classes of protein precursors, yet a single 67-amino acid biologically active IGF2 peptide, presents its own challenges, and these are amplified by the paucity of complete or correct information about IGF2 gene structure and expression currently available in gene repositories. Because new testable hypotheses derived from analysis of human IGF2 in data from genomic and gene expression sequencing projects could lead to novel insights about the biology of IGF2 and its actions in both human physiology and disease, it is incumbent that these genetic databases thoroughly describe the fundamentals of IGF2 gene structure and gene activity. As similar opportunities for translation of genetic information to medical practice will exist for other human genes and gene families, there is a need for ongoing critical review and correction of data in public gene repositories to ensure future users that the primary information is both accurate and complete.
Author contributions-P. R. conceptualization; data curation; formal analysis; funding acquisition; investigation; methodology; writing-original draft; project administration; writing-review and editing.