From Single Variants to Protein Cascades

Understanding the role of genetics in disease has become a central part of medical research. Non-synonymous single nucleotide variants (nsSNVs) in coding regions of human genes frequently lead to pathological phenotypes. Beyond single variations, the individual combination of nsSNVs may add to pathogenic processes. We developed a multiscale pipeline to systematically analyze the existence of quantitative effects of multiple nsSNVs and gene combinations in single individuals on pathogenicity. Based on this pipeline, we detected in a data set of 842 nsSNVs discovered in 76 genes related to cardiomyopathies, associated nsSNV combinations in seven genes present in at least 70% of all 639 patient samples, but not in a control cohort of healthy humans. Structural analyses of these revealed primarily an influence on the protein stability. For amino acid substitutions located at the protein surface, we generally observed a proximity to putative binding pockets. To computationally analyze cumulative effects and their impact, pathogenicity methods are currently being developed. Our approach supports this process, as shown on the example of a cardiac phenotype but can be likewise applied to other diseases such as cancer.

Understanding the role of genetics in disease has become a central part of medical research. Non-synonymous single nucleotide variants (nsSNVs) in coding regions of human genes frequently lead to pathological phenotypes. Beyond single variations, the individual combination of nsSNVs may add to pathogenic processes. We developed a multiscale pipeline to systematically analyze the existence of quantitative effects of multiple nsSNVs and gene combinations in single individuals on pathogenicity. Based on this pipeline, we detected in a data set of 842 nsSNVs discovered in 76 genes related to cardiomyopathies, associated nsSNV combinations in seven genes present in at least 70% of all 639 patient samples, but not in a control cohort of healthy humans. Structural analyses of these revealed primarily an influence on the protein stability. For amino acid substitutions located at the protein surface, we generally observed a proximity to putative binding pockets. To computationally analyze cumulative effects and their impact, pathogenicity methods are currently being developed. Our approach supports this process, as shown on the example of a cardiac phenotype but can be likewise applied to other diseases such as cancer.
Genetic alterations such as non-synonymous variants play a critical role in human diseases (1). Non-synonymous single nucleotide variants (nsSNVs) 2 refer to single base changes in DNA coding regions altering the amino acid sequence of a protein. A pathogenic phenotype may arise when an amino acid substitution affects structurally important residues and sites relevant for function, such as residues in catalytic sites of enzymes.
In Mendelian disorders, where variation in a single gene is responsible for the phenotypic consequence, thousand causative genes have been identified already (2). Detecting the causative genes in common diseases such as hypertension or diabetes, however, still remains a challenge. In general, these diseases are caused by a varying number of genetic alterations and are influenced by environmental factors that modulate the severity and type of disease-related phenotypes (3). Several nsSNVs in a single gene may exert a quantitative effect on the phenotypic observation, whereas a phenotype can also result from the combined action of nsSNVs in many genes. Because the experimental analysis of the pathogenic potential of variants is laborious and time consuming, several methods to predict the biological impact of nsSNVs on the corresponding proteins and their function have been developed in recent years. Most of these prediction methods are based on evolutionary information and/or combine functional and structural parameters as well as information drived from multiple sequence alignments. According to the obtained features, nsSNVs are classified into benign or pathogenic using different machine learning methods such as neural networks, random forests, support vector machines or Bayesian methods and mathematical operations (4 -6). However, none of these methods actually considers the influence of several nsSNVs or mutual effects. In fact, a human individual usually has more than one nsSNV within interacting genes or even within one single gene. From a medical point of view, especially the individual combination of nsSNVs may play a crucial role in clinical diagnostics regarding personalized medicine, because genetic variations, for example, have been identified to influence selection, dosing, and adverse events of medical drugs (7).
Recent studies analyzed the existence of compensatory mutations canceling the damaging effect of deleterious mutations (8). A compensatory mutation occurs when the loss caused by one mutation is corrected by its epistatic interaction with a second mutation at a different site in the genome. In contrast, Westphal et al. (9) identified a mild polymorphism in dolichyl pyrophosphate Man9GlcNAc2 ␣1,3-glucosyltransferase (ALG6) that putatively exacerbates the phenotypic effect caused by dysfunction of the phospho-manno-mutase2 (PMM2).
The interaction of different genes is generally referred to epistasis, however, a current review claims this term to be confusingly and even conflictively used in available literature (10).
To not further add to this confusion, we speak of interacting genes or proteins, respectively. Due to the existence of compensatory mutations, there might also exist a cumulative effect turning benign mutations in an ensemble, regardless whether in one or multiple interacting genes, into a pathogenic or even more pathogenic phenotype.
Traditional approaches predict the influence of nsSNVs on pathogenicity for single variants (11). However, the impact of nsSNVs on the phenotype of the patient can arise from multiple factors (12). Accumulating nsSNVs may have a quantitative effect on the dysfunction of the gene, as well as nsSNVs in interacting genes may aggravate or attenuate a pathological effect. In consequence, a multiscale analysis of nsSNVs, outlined in Fig. 1, which includes a three-dimensional context, interaction information, and functional cascades, is required to capture the whole range of nsSNV impact.
To address this issue, we determined putative nsSNV associations in a high-quality clinical data set of cardiomyopathy patients and studied the possible accumulation events affecting the phenotypes of the patient. We further analyzed available protein three-dimensional structures within our data set and the impact of nsSNVs to gain insights in pathogenic molecular mechanisms. Because subcellular protein localization can also influence phenotypic behavior (13), we analyzed the subcellular localization of the proteins encoded by the identified genes featuring associated nsSNVs.

Materials and Methods
Data Sets-We analyzed a data set comprising 842 nsSNVs in 76 genes that are clinically relevant for dilated cardiomyopathy (DCM) (known causes and likely candidate genes for DCM), found by studying the genetics in 639 unrelated patients with sporadic (51%) or familial (49%) DCM (14). In the investigated region, about 99.1% of the targeted genomic region is covered at least 50-fold and each patient carried an average of 32 nsSNVs.
In addition, we generated a control set based on the general population of the 1000 genomes project, to be able to evaluate detected putatively DCM-related nsSNV patterns. In detail, we downloaded the BAM files from 445 samples and applied the same variant calling and filtering algorithms, as described for the analyzed DCM cohort (14). To match the European INHERITANCE cohort, we only considered individuals with a European descent: Utah residents with northern and western European ancestry (CEU), Finnish in Finland (FIN), British in England and Scotland (GBR), Iberian populations in Spain (IBS), and Toscani in Italy (TSI).
For the nsSNVs in the DCM data set, we collected available information deposited in the SwissProt databases (15), dbSNP (dbSNP build 138) (16) and HGMD (July 2014) (17). SwissProt provides a collection of human polymorphisms and disease mutations (HUMSAVAR) assigned according to literature reports on probable disease association (18). Tightly coupled with dbSNP, ClinVar accessions report human variations and interpretations of the relationship of these variations to human health (19). Entries are labeled according to clinical significance. Moreover, the HGMD collates known (published) gene lesions responsible for human inherited disease. Next, we built FIGURE 1. Scheme for multiscale analysis of nsSNVs. The traditional approach referring to pathogenicity prediction of single nsSNVs is extended by more complex levels such as rule mining to detect nsSNV sets in patients. On the top level, pathway analysis as well as subcellular localization, are applied to study the pathogenic impact of multiple nsSNVs. a test set from the DCM data including only nsSNVs with at least one annotation in these three databases as well as benign or disease-linked information. Although about 60% are deposited in dbSNP with an rs ID (reference SNP cluster ID), only 45% have pathogenicity information available. The neutrallabeled set comprises 192 nsSNVs and the disease-associated set 147. A total of 55% nsSNVs in the DCM data set have no available clinical significance information, and even about 40% have neither an rs ID nor other known identifiers and annotations.
Association Rule Learning-Frequently, human individuals carry more than one single nsSNV even within one gene. Beyond this, genes and their encoded proteins interact with each other.
To discover strong relationships between variables in large data sources, association rule learning is generally applied. Association rule learning uncovers hidden relationships within the tested data by formulating association rules, whereas using different measures of interest to quantify the quality of the generated rules (20). The statistical significance of an association rule is measured by its support and confidence, where support refers to a frequency constraint determining the quantitative applicability of a rule and confidence measures its reliability.
Via association rule learning, we studied whether there are frequent combinations of nsSNVs within our DCM data set and the healthy control cohort. In fact, we identified combinations of nsSNVs occurring within one single gene as well as combinations of nsSNVs in one gene accumulating with combinations in other genes.
We applied the R package a rules (21) using the implemented a priori algorithm (22). The confidence threshold was set to 0.8 and different levels of support, starting with at least 0.5, were tested.
Network Analysis-Due to the growing availability of high throughput biological data, the analysis of molecular networks gained significant interest. To determine the biological and functional connections of the detected associated genes with nsSNV combinations, we used several information sources: the STRING database (23), the UniProt-GOA database (24), and the KEGG PATHWAY database (25).
Besides the biological connections among the associated nsSNVs, we also investigated their topological characteristics within the STRING human interaction network. To detect putative interaction hubs, we determined betweenness and degree for each node in the human STRING network using the R package igraph. The degree of a node identifies the number of edges connected to the node, whereas the node betweenness is an indicator of the centrality of the node in the network.
Structural Location of Amino Acid Substitutions Introduced by nsSNVs-The effect of a nsSNV critically depends on the structural location of the mutated residue, especially if it is buried in the hydrophobic core or exposed on the protein surface (27). We selected proteins within our DCM data set, with a complete Protein Data Bank (PDB) (28) three-dimensional structure available. Next, we calculated solvent accessibilities using Naccess (29) to determine whether mutated residues tend to accumulate on the protein surface or are rather buried inside the protein. Because previous studies identified the majority of pathogenic nsSNVs to destabilize the structure of a protein (30), we also analyzed protein stability changes upon mutation based on I-Mutant2.0 predictions (31). The predicted free energy change in I-Mutant2.0 is calculated from the unfolded Gibbs free energy change of the mutated protein minus the unfolding free energy value of the native protein.
Predictions were based on default values for temperature (25) and pH (7).
We also predicted possible binding pockets via LIGSITEcsc (32). LIGSITEcsc automatically identifies binding pockets on the surface of a protein based on the Connolly surface and the degree of conservation. We used the default parameter settings of 1-Å grid space and a probe radius of 5 Å.
Subcellular Localization of Mutated Proteins-For visualizing and analyzing the localization of the proteins encoded by genes with nsSNVs, we used the CELLmicrocosmos 4.2 Pathway-Integration (CmPI) (33). CmPI is connected to DAWIS-MD, a data warehouse containing a number of databases (34). In the context of this work, the following databases were queried: BRENDA (35), GO (36), Reactome (37), and UniProt (18). Because each protein usually can obtain different localization entries from databases, CmPI applies a context-based localization prioritization. The cell component acquiring the most localization hits is assigned to the queried protein and verified by cellular topology analysis: connected proteins should not receive distant localizations (e.g. nucleus and extracellular matrix).

Results
Association Rule Learning and Network Analysis-To analyze whether nsSNVs occur accumulated within one gene or within interacting genes, we applied association rule learning. In the DCM data, we identified nsSNV combinations accumulating in the single genes MYPN (Myopalladin), CACNA1C (voltage-dependent L-type calcium channel subunit ␣-1C), DMD (Dystrophin), ADRB2 (␤2-adrenergic receptor), and RBM20 (RNA-binding protein 20) with high support and significant confidence values. Table 1 lists the detected nsSNVs significantly associated within one gene. The nsSNV combinations in RBM20 and CACNA1C are even found in at least 90% of all patients.
According to database entries in HGMD and SwissProt, RBM20 is already related to DCM and CACNA1C to the Timothy and Brugada Syndrome. MYPN, which incorporates the most associated nsSNV accumulations, is linked to different forms of cardiomyopathies (familial, hypertrophic, and dilated). ADRB2 participates in signal transduction and namely in the adrenergic signaling in cardiomyocytes. DMD is involved in several pathways relevant for cardiac diseases such as DCM, hypertrophic cardiomyopathy, arrhythmogenic right ventricular cardiomyopathy, and viral myocarditis. All identified nsSNV associations, however, are annotated as benign nsSNVs.
Furthermore, we detected nsSNVs in seven different genes (CACNA1C, SMYD2 (N-lysine methyltransferase SMYD2), PARVB (␤-parvin), KCNE1 (potassium voltage-gated channel subfamily E member 1), RBM20, KCNQ2 (potassium voltagegated channel subfamily KQT member 2), and JUP (Junction plakoglobin) significantly associated with each other in the DCM patients. In at least 70% of all DCM patients, these seven genes including the particular nsSNVs revealed strong associations to each other. Using the corresponding association rule setup, these specific nsSNV-gene combinations are not present in the control cohort. Only SMYD2, PARVB, KCNE1, and JUP revealed an association in healthy controls. According to a large-scale analysis of the human transcriptome in 2004, all of the associated genes revealed significant expression in the heart (38). The majority of detected genes with associated nsSNVs is already known in the context of diseases such as Brugada Syndrome, Long QT Syndrome, Naxos Disease, and different stages of DCM. In contrast, all association rule-detected nsSNVs within these genes are annotated as benign, except the N749T mutation in KCNQ2, which has no available annotations. Fig. 2 compares the information annotations of all genetic variants within the DCM data set with the association rule detected. Among the 639 DCM patients, we identified 26 without already known or annotated disease-associated nsSNVs. The 26 DCM patients mainly carry benign and not annotated variants. Interestingly, the intersection of their inherited nsSNVs revealed exactly the detected associated nsSNVs in CACNA1C, SMYD2, PARVB, KCNE1, RBM20, KCNQ2, and JUP. Table 2 lists the detailed nsSNV combinations.
We further analyzed functional interaction and biological intersection of the genes comprising the identified associated nsSNVs and the corresponding proteins including the introduced amino acid substitutions, respectively. Referring to the corresponding GO terms of the genes with associated nsSNV combinations, the majority participates in protein binding, voltage-gated ion channel activity, and transport. A mutation of residues involved in complex interaction networks can critically influence large interaction cascades by spreading the implemented loss across the network. To provide more insights into a putative relationship of the detected genes, we combined and visualized the extracted interaction information from the STRING human network with the available biological knowledge in Fig. 3. We highlighted GO overlaps within the resulting networks using Cytoscape (39). The edges in the network refer to available interactions between their nodes. CACNA1C, JUP, SMYD2, PARVB, and KCNQ2 are directly connected to large hubs within the human network. KCNE1, KCNQ2, and CACNA1C interact functionally with each other (40). KCNE1 attenuates the current amplitude of the KCNQ2 channel subunit and slows its gating kinetics (41). According to the KEGG PATHWAY database, KCNE1 is part of the adrenergic signaling in cardiomyocytes. A perturbation of its channel function by inherited mutations results in increased susceptibility to cardiac arrhythmias. KCNQ2 belongs to the cholinergic synapse and CACNA1C even takes part in both pathways. Interestingly, KCNE1, CACNA1C, and KCNQ2 are already targets of drugs against arrhythmia, atrial fibrillation, congestive heart failure, left ventricular hypertrophy, and isolated systolic hypertension (42). Furthermore, there is recent evidence that the post-transmembrane domain region of KCNE1 interacts with the KCNQ1 channel to modulate the I(K) current amplitude and gating kinetics (43).
Structural Location of Amino Acid Substitutions Introduced by nsSNVs-The protein structure reveals interactions between residues that are distant in primary sequence but close in threedimensional space. Solvent accessibility provides an intuitive and quantitatively reasonable idea of the complexity of the molecular interaction network involved in a residue (44). In consequence, we calculated solvent accessibilities using Naccess (29) for the 8 proteins (comprising 46 amino acid substitutions) in our data set with an available PDB (28) structure to  analyze whether disease-associated amino acid substitutions cluster on the protein surface or at buried sites. The results confirm the findings of Wang and Moult (45) for single nsSNVs. The majority (89%) of disease-linked mutations introduced by nsSNVs are located inside the protein probably affecting stability, whereas benign-annotated substitutions mainly cluster on the protein surface (67%). To analyze putative protein stability changes upon mutation, we calculated stability change predictions via I-Mutant2.0. The majority (81%) of substitutions are predicted to decrease protein stability, independent of their location in the three-dimensional protein structure (see Fig. 4). For five protein structures, we were also able to predict possible binding pockets using LIGSITEcsc. 8 of 10 mutations at the surface of the protein are found close to a predicted binding pocket. Interestingly, two mutations introduced by the detected significantly associated mutation nsSNV combinations, KCNE1 S38G and SMYD2 G165E, are also located close to a possible binding pocket (see Fig. 5) of the corresponding protein. SMYD2 lysine methylates the tumor suppressor TP53, leading to decreased DNA-binding activity and subsequent transcriptional regulation activity of TP53 (46). According to literature, the binding interface of TP53 and SMYD2 is located between the catalytic SET domain (residue 1-282) and the C-terminal domain (47). In addition, SMYD2 has been reported to map primarily to the cytoplasm indicating SMYD2 targets a small subset of histones at specific chromatin loci as well as non-histone substrates (48).
Subcellular Localization of Mutated Proteins-Based on the previously discussed methods, seven genes were identified  showing specific nsSNVs in more than 70% of all analyzed patients: CACNA1C, JUP, KCNE1, KCNQ2, PARVB, RBM20, and SMYD2. In particular, all of the detected genes have a significant expression in the heart. Using CmPI, Homo sapiensrelated potential localizations for these seven genes were acquired using cell component-gene association data from the aforementioned databases. The associated proteins of these genes show five potential localizations: nucleus, cytosol, cell membrane, lysosome, and the extracellular matrix. Moreover, five of them provide multiple potential localizations. An overview and distribution of these locations can be found in Fig. 6. Based on the localization data, the hypothesis can be formulated that these proteins are assembled in a potential cascade starting from the nucleus, through the cytosol, entering the cell membrane, and proceeding to the extracellular matrix, or vice versa. This theory is supported by the fact that RBM20 is exclusively localized at the nucleus and KCNQ2 at the cell membrane, whereas PARVB seems to travel between the extracellular matrix, the cell membrane, and the cytosol. We visualized the connections of these proteins including the assigned subcellular locations in Fig. 7. For the purpose of clarity, Fig. 7 condenses the detected interactions to the identified potential cascade. However, the in silico study requires experimental analysis to validate the formulated cascade hypothesis.

Discussion
A human individual usually has more than one nsSNV. From a medical point of view, the individual combination of nsSNVs may play a crucial role in clinical diagnostics regarding personalized medicine (49). Previous studies analyzed the occurrence and characteristics of compensatory mutations (8), although there might also be cumulative effects of mutations, packing single benign effects together to an observable disease phenotype. More precisely, benign-annotated nsSNVs in combination might be responsible for a pathological effect. Westphal et al. (9), for example, studied congenital disorders of glycosylation and identified a mild polymorphism in ALG6 putatively exacerbating an already severe pathogenic phenotype caused by PMM2 dysfunction.
In this study, we developed a multiscale pipeline to systematically analyze the existence of quantitative effects of nsSNV sets. We detected DCM patients without identified disease-associated nsSNVs, but inhering mainly benign-labeled variants. Via association rule learning, we detected associated combinations of nsSNVs within at least 70% of all cardiomyopathy patients in the data set. These specific combinations could not FIGURE 4. Information on nsSNV-introduced mutations in proteins with available structure. Localization within protein structure was calculated using Naccess. Pathogenicity information refers to annotations deposited in SwissProt, dbSNP, and the HGMD. Stability changes were predicted via I-Mutant2.0.  be identified as associated in the control cohort of healthy humans, which hints to disease relevance and, however, requires further analysis. Due to the lack of prediction tools able to assess a cumulative effect of nsSNVs, a pathogenicity prediction for the identified associated nsSNVs was not possible. Furthermore, a three-dimensional structure of the encoded protein was available for only two of the identified associated nsSNV genes, KCNE1 and SMYD2. For the remaining associated genes, even an adequate template for structural modeling was missing. Interestingly, the associated nsSNVs in KCNE1 (S38G) and SMYD2 (G165E) are located at the surface of the encoded protein close to the predicted binding pockets. Stability change predictions based on protein sequence, however, revealed that all of the associated amino acid substitutions decrease protein stability except SMYD2 G165E. In addition, we studied available pathway information including GO annotations and analyzed the interaction networks. Proteins might act at different stages of the same pathway contributing quantitatively to the progressive dysfunction of the pathway until a disease phenotype is observed. Both genes in the study by Westphal et al. (9), for example, encode enzymes involved in a different part of the post-translational modification process without a direct interaction. KCNE1 and CACNA1C as well as KCNQ2 and CACNA1C participate in the same pathways and mainly contribute to voltage-gated ion channel activity and transport. Ion channels are key components in a wide variety of biological processes, such as muscle contraction (e.g. cardiac muscle contraction), epithelial transport of nutrients and ions, or T-cell activation. A number of genetic disorders (e.g. Long QT syndrome, Brugada syndrome) are related to ion channel dysfunctions. JUP and PARVB are involved in cell junction organization and interact as well as SMYD2 directly with the top-ranked hubs within the human network. Cell junctions play a major role in communication between neighboring cells and cell stress reduction. The cellular function of the identified genes, however, is largely unclear despite their role in pathological processes. Although, for example, a 2-bp deletion mutation in the junction protein plakoglobin (JUP) has been found essential for the molecular genetics of arrhythmogenic right ventricular cardiomyopathy, the molecular basis of reduced JUP in the cell membrane of arrhythmogenic right ventricular cardiomyopathy remains unclear (50). It still has to be clarified whether desmosome protein mutations impair desmosome assembly and cause reduced incorporation of JUP into the cell membrane.
Here, we performed a subcellular localization enabling the formulation of the hypothesis that the seven identified proteins form a potential cascade: RBM20 is only found in the nucleus, SMYD2 travels between the nucleus and the cytosol, and the other five genes are mostly associated to the cell membrane, where PARVB shows potential localizations between the extracellular matrix, cell membrane, and cytosol. Further experimental studies are required to analyze and validate this cascade hypothesis.
Because the used data set only comprises genes clinically relevant for DCM due to known causes or the proposal of likely candidates, an association analysis might miss genes or nsSNVs not yet identified to have an influence on DCM. Previous studies, however, supposed the inclusion of known disease pathobiology and prior knowledge to improve analysis results (51). So far, to the best of our knowledge, the methods to analyze the effect of multiple nsSNVs in different genes are limited and suitable approaches have to be developed. In this study, we targeted already detected DCM candidate genes to establish proper analysis strategies.
Finally, all systematic analyses point to a connection of the detected genes featuring associated nsSNVs, in functionality as well as in their contribution to biological pathways. Moreover, the specific nsSNV combinations identified as significantly associated in DCM patients could not be detected associated within the healthy control data. In a next step, further association studies on even larger patient cohorts with cardiomyopathies are required to validate the identified nsSNVs. Additional studies on patients with phenotypes different from cardiomyopathies, in particular, can assess nsSNV specificity. The functional understanding of the identified genes, however, is in many cases limited to their role in physiological or pathological processes leaving many open questions about their cellular role. Without a sufficient understanding of these, the design of experiments, which could be used to address the cumulative effect of specific nsSNVs in several genes, is highly challenging and even near to impossible in many cases. The developed multiscale analysis pipeline has the potential to promote this process.

Conclusion
The impact of nsSNVs in coding genes on the cause and the severity of a disease has become a key task in human health care. A pathogenic phenotype can result from several nsSNVs in one single gene as well as from the combination of nsSNVs in many genes. Single gene dysfunctions linked to diseases have already undergone extensive studies. However, the majority of common diseases such as cardiomyopathy are probably caused by several genetic alterations and environmental influences. In this study, we identified associated nsSNVs in seven genes putatively contributing to cardiomyopathic phenotypes. Due to missing computational methods to analyze cumulative nsSNVs and to assess their impact on pathogenicity, the validation of the clinical relevance is limited. In fact, genetic testing enables predictive diagnosis and can enhance presymptomatic intervention. Future studies, however, focusing on translation of computational findings to applicable mechanisms in clinical routine and capturing diagnostic demands, are highly required.
Author Contributions-S. C. M. and C. B. contributed to data analysis. B. S. performed the cell localization studies. B. M. and J. H. were responsible for data generation. S. C. M., B. S., E. M., and A. K. wrote the manuscript. All authors read and approved the final manuscript.