Genome-wide association studies (GWAS) and whole-exome sequencing (WES) generate massive amounts of genomic variant information, and a major challenge is to identify which variations drive disease or contribute to phenotypic traits. Because the majority of known disease-causing mutations are exonic non-synonymous single nucleotide variations (nsSNVs), most studies focus on whether these nsSNVs affect protein function. Computational studies show that the impact of nsSNVs on protein function reflects sequence homology and structural information and predict the impact through statistical methods, machine learning techniques, or models of protein evolution. Here, we review impact prediction methods and discuss their underlying principles, their advantages and limitations, and how they compare to and complement one another. Finally, we present current applications and future directions for these methods in biological research and medical genetics.
The structure and function of proteins underlie most aspects of biology and their mutational perturbations often cause disease. To identify the molecular determinants of function as well as targets for drugs, it is central to characterize the important residues and how they cluster to form functional sites. The Evolutionary Trace (ET) achieves this by ranking the functional and structural importance of the protein sequence positions. ET uses evolutionary distances to estimate functional distances and correlates genotype variations with those in the fitness phenotype. Thus, ET ranks are worse for sequence positions that vary among evolutionarily closer homologs but better for positions that vary mostly among distant homologs. This approach identifies functional determinants, predicts function, guides the mutational redesign of functional and allosteric specificity, and interprets the action of coding sequence variations in proteins, people and populations. Now, the UET database offers pre-computed ET analyses for the protein structure databank, and on-the-fly analysis of any protein sequence. A web interface retrieves ET rankings of sequence positions and maps results to a structure to identify functionally important regions. This UET database integrates several ways of viewing the results on the protein sequence or structure and can be found at http://mammoth.bcm.tmc.edu/uet/.
Advances in cellular, molecular, and disease biology depend on the comprehensive characterization of gene interactions and pathways. Traditionally, these pathways are curated manually, limiting their efficient annotation and, potentially, reinforcing field-specific bias. Here, in order to test objective and automated identification of functionally cooperative genes, we compared a novel algorithm with three established methods to search for communities within gene interaction networks. Communities identified by the novel approach and by one of the established method overlapped significantly (q < 0.1) with control pathways. With respect to disease, these communities were biased to genes with pathogenic variants in ClinVar (p ≪ 0.01), and often genes from the same community were co-expressed, including in breast cancers. The interesting subset of novel communities, defined by poor overlap to control pathways also contained co-expressed genes, consistent with a possible functional role. This work shows that community detection based on topological features of networks suggests new, biologically meaningful groupings of genes that, in turn, point to health and disease relevant hypotheses.
25 The scientific literature is vast, growing, and increasingly specialized, making it difficult to 26 connect disparate observations across subfields. To address this problem, we sought to develop 27 automated hypothesis generation by networking at scale the MeSH terms curated by the National 28 Library of Medicine. The result is a Mesh Term Objective Reasoning (MeTeOR) approach that 29 tallies associations among genes, drugs and diseases from PubMed and predicts new ones.30 Comparisons to reference databases and algorithms show MeTeOR tends to be more reliable. We 31 also show that many predictions based on the literature prior to 2014 were published 32 subsequently. In a practical application, we validated experimentally a surprising new 33 association found by MeTeOR between novel Epidermal Growth Factor Receptor (EGFR) 34 associations and CDK2. We conclude that MeTeOR generates useful hypotheses from the 35 literature (http://meteor.lichtargelab.org/). AUTHOR SUMMARY37 The large size and exponential expansion of the scientific literature forms a bottleneck to 38 accessing and understanding published findings. Manual curation and Natural Language 39 Processing (NLP) aim to address this bottleneck by summarizing and disseminating the 40 knowledge within articles as key relationships (e.g. TP53 relates to Cancer). However, these 41 methods compromise on either coverage or accuracy, respectively. To mitigate this compromise, 42 we proposed using manually-assigned keywords (MeSH terms) to extract relationships from the 43 publications and demonstrated a comparable coverage but higher accuracy than current NLP 44 methods. Furthermore, we combined the extracted knowledge with semi-supervised machine 45 learning to create hypotheses to guide future work and discovered a direct interaction between 46 two important cancer genes. 47 48 49 3 50 INTRODUCTION 51 It is difficult to keep abreast of new publications. Currently, PubMed contains over 28 million 52 papers (http://www.ncbi.nlm.nih.gov/pubmed)-3 million more than three years ago. This steady 53 accumulation of findings gives rise to a large number of latent connections that Literature-Based 54 Discovery (LBD) seeks to systematically recognize and integrate [1], such as Swanson's original 55 finding linking fish oil to the treatment of Raynaud's disease [2]. Since this original analysis, 56 LBD has been extensively replicated, automated and expanded [3-10], leading to new patterns of 57 inference -e.g. locating opposing actions of a disease and a drug on given physiological 58 functions [11] -and to new discoveries [12]. Successes include the automated discovery of 59 protein functions [13, 14] and of the genetic bases of disease [15, 16], as well as the stratification 60 of patient phenotypes [17] and outcomes [18]. 61A limitation of LBD, however, is its dependence on knowledge extraction. It either relies 62 on human curation, which is not scalable, or on comprehensive text-mining, for which 63 algorithms are less accurate [19, 20]. One of the largest curated m...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.