Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues.
Genome-wide analysis of transcription in the malaria parasite Plasmodium falciparum has revealed robust variation in steady-state mRNA abundance throughout the 48-h intraerythrocytic developmental cycle (IDC), suggesting that this process is highly dynamic and tightly regulated. Here, we utilize rapid 4-thiouracil (4-TU) incorporation via pyrimidine salvage to specifically label, capture, and quantify newly-synthesized RNA transcripts at every hour throughout the IDC. This high-resolution global analysis of the transcriptome captures the timing and rate of transcription for each newly synthesized mRNA in vivo, revealing active transcription throughout all IDC stages. Using a statistical model to predict the mRNA dynamics contributing to the total mRNA abundance at each timepoint, we find varying degrees of transcription and stabilization for each mRNA corresponding to developmental transitions. Finally, our results provide new insight into co-regulation of mRNAs throughout the IDC through regulatory DNA sequence motifs, thereby expanding our understanding of P. falciparum mRNA dynamics.
Motivation: There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting.Results: We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses.Availability and implementation: An R software package, called , is available in CRAN.Contact: jstorey@princeton.edu
BackgroundA survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied.ResultsWe introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard).ConclusionWe introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
Over the last decade there has been an extensive evolution in the Artificial Intelligence (AI) field. Modern radiation oncology is based on the exploitation of advanced computational methods aiming to personalization and high diagnostic and therapeutic precision. The quantity of the available imaging data and the increased developments of Machine Learning (ML), particularly Deep Learning (DL), triggered the research on uncovering "hidden" biomarkers and quantitative features from anatomical and functional medical images. Deep Neural Networks (DNN) have achieved outstanding performance and broad implementation in image processing tasks. Lately, DNNs have been considered for radiomics and their potentials for explainable AI (XAI) may help classification and prediction in clinical practice. However, most of them are using limited datasets and lack generalized applicability. In this study we review the basics of radiomics feature extraction, DNNs in image analysis, and major interpretability methods that help enable explainable AI. Furthermore, we discuss the crucial requirement of multicenter recruitment of large datasets, increasing the biomarkers variability, so as to establish the potential clinical value of radiomics and the development of robust explainable AI models.
BackgroundGenetic risk scores have been developed for coronary artery disease and atherosclerosis, but are not predictive of adverse cardiovascular events. We asked whether peripheral blood expression profiles may be predictive of acute myocardial infarction (AMI) and/or cardiovascular death.MethodsPeripheral blood samples from 338 subjects aged 62 ± 11 years with coronary artery disease (CAD) were analyzed in two phases (discovery N = 175, and replication N = 163), and followed for a mean 2.4 years for cardiovascular death. Gene expression was measured on Illumina HT-12 microarrays with two different normalization procedures to control technical and biological covariates. Whole genome genotyping was used to support comparative genome-wide association studies of gene expression. Analysis of variance was combined with receiver operating curve and survival analysis to define a transcriptional signature of cardiovascular death.ResultsIn both phases, there was significant differential expression between healthy and AMI groups with overall down-regulation of genes involved in T-lymphocyte signaling and up-regulation of inflammatory genes. Expression quantitative trait loci analysis provided evidence for altered local genetic regulation of transcript abundance in AMI samples. On follow-up there were 31 cardiovascular deaths. A principal component (PC1) score capturing covariance of 238 genes that were differentially expressed between deceased and survivors in the discovery phase significantly predicted risk of cardiovascular death in the replication and combined samples (hazard ratio = 8.5, P < 0.0001) and improved the C-statistic (area under the curve 0.82 to 0.91, P = 0.03) after adjustment for traditional covariates.ConclusionsA specific blood gene expression profile is associated with a significant risk of death in Caucasian subjects with CAD. This comprises a subset of transcripts that are also altered in expression during acute myocardial infarction.
This document aims to agree on a broad, international strategy for the implementation of open scholarship that meets the needs of different national and regional communities but works globally.Scholarly research can be idealised as an inspirational process for advancing our collective knowledge to the benefit of all humankind. However, current research practices often struggle with a range of tensions, in part due to the fact that this collective (or “commons”) ideal conflicts with the competitive system in which most scholars work, and in part because much of the infrastructure of the scholarly world is becoming largely digital. What is broadly termed as Open Scholarship is an attempt to realign modern research practices with this ideal. We do not propose a definition of Open Scholarship, but recognise that it is a holistic term that encompasses many disciplines, practices, and principles, sometimes also referred to as Open Science or Open Research. We choose the term Open Scholarship to be more inclusive of these other terms. When we refer to science in this document, we do so historically and use it as shorthand for more general scholarship.The purpose of this document is to provide a concise analysis of where the global Open Scholarship movement currently stands: what the common threads and strengths are, where the greatest opportunities and challenges lie, and how we can more effectively work together as a global community to recognise and address the top strategic priorities. This document was inspired by the Foundations for OER Strategy Development and work in the FORCE11 Scholarly Commons Working Group, and developed by an open contribution working group.Our hope is that this document will serve as a foundational resource for continuing discussions and initiatives about implementing effective strategies to help streamline the integration of Open Scholarship practices into a modern, digital research culture. Through this, we hope to extend the reach and impact of Open Scholarship into a global context, making sure that it is truly open for all. We also hope that this document will evolve as the conversations around Open Scholarship progress, and help to provide useful insight for both global co-ordination and local action. We believe this is a step forward in making Open Scholarship the norm.Ultimately, we expect the impact of widespread adoption of Open Scholarship to be diverse. We expect novel research practices to accelerate the pace of innovation, and therefore stimulate critical industries around the world. We could also expect to see an increase in public trust of science and scholarship, as transparency becomes more normative. As such, we expect interest in Open Scholarship to increase at multiple levels, due to its inherent influence on society and global economics.
Integration of multi-omics in cardiovascular diseases (CVDs) presents high potentials for translational discoveries. By analyzing abundance levels of heterogeneous molecules over time, we may uncover biological interactions and networks that were previously unidentifiable. However, to effectively perform integrative analysis of temporal multi-omics, computational methods must account for the heterogeneity and complexity in the data. To this end, we performed unsupervised classification of proteins and metabolites in mice during cardiac remodeling using two innovative deep learning (DL) approaches. First, long short-term memory (LSTM)-based variational autoencoder (LSTM-VAE) was trained on time-series numeric data. The low-dimensional embeddings extracted from LSTM-VAE were then used for clustering. Second, deep convolutional embedded clustering (DCEC) was applied on images of temporal trends. Instead of a two-step procedure, DCEC performes a joint optimization for image reconstruction and cluster assignment. Additionally, we performed K-means clustering, partitioning around medoids (PAM), and hierarchical clustering. Pathway enrichment analysis using the Reactome knowledgebase demonstrated that DL methods yielded higher numbers of significant biological pathways than conventional clustering algorithms. In particular, DCEC resulted in the highest number of enriched pathways, suggesting the strength of its unified framework based on visual similarities. Overall, unsupervised DL is shown to be a promising analytical approach for integrative analysis of temporal multi-omics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.