Abstract:Driver mutations propel oncogenesis and occur much less frequently than passenger mutations. The need for automatic and accurate identification of driver mutations has increased dramatically with the exponential growth of mutation data. Current computational solutions to identify driver mutations rely on sequence homology. Here we construct a machine learning-based framework that does not rely on sequence homology or domain knowledge to predict driver missense mutations. A windowing approach to represent the l… Show more
“…Medical big and high-dimensional data may cause inefficiency and low accuracy. To overcome this issue, many researchers utilize feature extraction algorithms in healthcare informatics [Soliman et al 2015].…”
The explosive growth and widespread accessibility of digital health data have led to a surge of research activity in the healthcare and data sciences fields. The conventional approaches for health data management have achieved limited success as they are incapable of handling the huge amount of complex data with high volume, high velocity, and high variety. This article presents a comprehensive overview of the existing challenges, techniques, and future directions for computational health informatics in the big data age, with a structured analysis of the historical and state-of-the-art methods. We have summarized the challenges into four Vs (i.e., volume, velocity, variety, and veracity) and proposed a systematic data-processing pipeline for generic big data in health informatics, covering data capturing, storing, sharing, analyzing, searching, and decision support. Specifically, numerous techniques and algorithms in machine learning are categorized and compared. On the basis of this material, we identify and discuss the essential prospects lying ahead for computational health informatics in this big data age.
“…Medical big and high-dimensional data may cause inefficiency and low accuracy. To overcome this issue, many researchers utilize feature extraction algorithms in healthcare informatics [Soliman et al 2015].…”
The explosive growth and widespread accessibility of digital health data have led to a surge of research activity in the healthcare and data sciences fields. The conventional approaches for health data management have achieved limited success as they are incapable of handling the huge amount of complex data with high volume, high velocity, and high variety. This article presents a comprehensive overview of the existing challenges, techniques, and future directions for computational health informatics in the big data age, with a structured analysis of the historical and state-of-the-art methods. We have summarized the challenges into four Vs (i.e., volume, velocity, variety, and veracity) and proposed a systematic data-processing pipeline for generic big data in health informatics, covering data capturing, storing, sharing, analyzing, searching, and decision support. Specifically, numerous techniques and algorithms in machine learning are categorized and compared. On the basis of this material, we identify and discuss the essential prospects lying ahead for computational health informatics in this big data age.
“…Regression-based methods appeared in 11 selected papers, most of which adopted logistic regression [26,29,31,32,37,56,57]. We also found papers using regularized regressions, including Ridge [49] and Lasso regression [23].…”
Section: Methods Based On Supervised Learningmentioning
confidence: 95%
“…The first proposals by Carter et al [19] and Capriotti et al [20] were based on these algorithms. Among the SVM-based approaches, whereas most papers adopted the traditional SVM algorithm [20,22,24,27,31,32,39,55,56,57,58], we observed three papers using OneClass SVM [45,49,59] and one paper using Sequential Minimal Optimization (SMO) [28]. SVM is a popular and consolidated technique in the field, as it continues to be largely applied throughout the years since 2011.…”
Section: Methods Based On Supervised Learningmentioning
confidence: 99%
“…Among these, six papers [23,29,36,53,54,57] aimed to distinguish oncogene and tumor suppressor gene (TSG) (i.e., the two subclasses of CDGs), whereas the others focused on classifying a given gene as CDG or not. Seven papers targeted predictions on mutation level [24,27,31,37,42,43,58], most of which restricted the analysis for missense mutations. We also found one paper aiming at identifying cancer modules to discover cancer driver genes [25] and other focusing on the prediction of false positive CDGs [55].…”
Section: Overview Of Selected Papersmentioning
confidence: 99%
“…Finally, amino acids substitution scores were employed by several studies [19,22,24,27,28], most of which integrated distinct substitution scoring matrices. Tan et al [22], for instance, defined 51 features by integrating dozens of substitution scoring matrices from the AAIndex database, which was explored in other studies [28,31]. The evolution-based subcategory was employed in 14 studies, most of which computed evolutionary conservation scores using distinct strategies or tools [19,20,24,27,28,34,37,42,43,54,57].…”
Identifying the genes and mutations that drive the emergence of tumors is a major step to improve understanding of cancer and identify new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in identifying genomic patterns associated with cancer drivers and developing models to predict driver events. Machine learning (ML) has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
Identifying the genes and mutations that drive the emergence of tumors is a critical step to improving our understanding of cancer and identifying new directions for disease diagnosis and treatment. Despite the large volume of genomics data, the precise detection of driver mutations and their carrying genes, known as cancer driver genes, from the millions of possible somatic mutations remains a challenge. Computational methods play an increasingly important role in discovering genomic patterns associated with cancer drivers and developing predictive models to identify these elements. Machine learning (ML), including deep learning, has been the engine behind many of these efforts and provides excellent opportunities for tackling remaining gaps in the field. Thus, this survey aims to perform a comprehensive analysis of ML-based computational approaches to identify cancer driver mutations and genes, providing an integrated, panoramic view of the broad data and algorithmic landscape within this scientific problem. We discuss how the interactions among data types and ML algorithms have been explored in previous solutions and outline current analytical limitations that deserve further attention from the scientific community. We hope that by helping readers become more familiar with significant developments in the field brought by ML, we may inspire new researchers to address open problems and advance our knowledge towards cancer driver discovery.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.