Predicting disease status for a complex human disease using genomic data is an important, yet challenging, step in personalized medicine. Among many challenges, the so-called curse of dimensionality problem results in unsatisfied performances of many state-of-art machine learning algorithms. A major recent advance in machine learning is the rapid development of deep learning algorithms that can efficiently extract meaningful features from high-dimensional and complex datasets through a stacked and hierarchical learning process. Deep learning has shown breakthrough performance in several areas including image recognition, natural language processing, and speech recognition.However, the performance of deep learning in predicting disease status using genomic datasets is still not well studied. In this article, we performed a review on the four relevant articles that we found through our thorough literature review. All four articles used autoencoders to project high-dimensional genomic data to a low dimensional space and then applied the state-of-the-art machine learning algorithms to predict disease status based on the low-dimensional representations. This deep learning approach outperformed existing prediction approaches, such as prediction based on probe-wise screening and prediction based on principal component analysis. The limitations of the current deep learning approach and possible improvements were also discussed.PeerJ Preprints | https://doi.org/10.7287/peerj.preprints
Most predictive models based on gene expression data do not leverage information related to gene splicing, despite the fact that splicing is a fundamental feature of eukaryotic gene expression. Cigarette smoking is an important environmental risk factor for many diseases, and it has profound effects on gene expression. Using smoking status as a prediction target, we developed deep neural network predictive models using gene, exon, and isoform level quantifications from RNA sequencing data in 2,557 subjects in the COPDGene Study. We observed that models using exon and isoform quantifications clearly outperformed gene-level models when using data from 5 genes from a previously published prediction model. Whereas the test set performance of the previously published model was 0.82 in the original publication, our exon-based models including an exon-to-isoform mapping layer achieved a test set AUC (area under the receiver operating characteristic) of 0.88, which improved to an AUC of 0.94 using exon quantifications from a larger set of genes. Isoform variability is an important source of latent information in RNA-seq data that can be used to improve clinical prediction models.
Rationale: Emphysema is a key component of COPD with important prognostic implications. Identifying blood-based biomarkers of emphysema will facilitate early diagnosis and possible development of targeted therapies. Objectives: Discover blood transcriptomic and proteomic biomarkers for chest computed tomography-quantified emphysema in smokers and develop predictive biomarker panels. Methods: Emphysema blood biomarker discovery was performed using differential gene expression, alternative splicing, and protein association analyses in a training set of 2,370 COPDGene participants with available whole blood RNA sequencing, plasma SomaScan proteomics, and clinical data. Validation was conducted in a testing set of 1,016 COPDGene subjects. Since body mass index (BMI) and emphysema often co-occur, we performed a mediation analysis to quantify the effect of BMI on gene and protein associations with emphysema. Predictive models were also developed using elastic net to predict quantitative emphysema from cell blood count, RNA sequencing, and proteomic biomarkers. Model accuracy was assessed by area under the receiver-operator-characteristic-curves (AUROC) for subjects stratified into tertiles of emphysema severity. Measurements and Main Results: 4,913 genes, 1,478 isoforms, 386 exons, and 881 proteins were significantly associated with emphysema (FDR 10%). 75% and 77% of genes and proteins, respectively, were mediated by BMI. The significantly enriched biological pathways were involved in inflammation and cell differentiation, differing between the most and least BMI-mediated genes. The cell blood count plus protein model achieved the highest performance with an AUROC of 0.89. Conclusions: Blood transcriptome and proteome-wide analyses reveal key biological pathways of emphysema and enhance the prediction of emphysema.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.