DNA copy number aberrations (CNAs) are genetic alterations common in cancer cells. Their transcriptional consequences are still poorly understood. Based on the fact that DNA copy number (CN) is highly correlated with the genomic position, we have applied a segmentation algorithm to gene expression (GE) to explore its relation with CN. We have found a strong correlation between segmented CN (sCN) and segmented GE (sGE), corroborating that CNAs have clear effects on genome-wide expression. We have found out that most of the recurrent regions of sGE are common to those obtained from sCN analysis. Results for two cancer datasets confirm the known targets of aberrations and provide new candidates to study. The suggested methodology allows to find recurrent aberrations specific to sGE, revealing loci where the expression of the genes is independent from their CNs. R code and additional files are available as supplementary material.
Motivation
Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous data avoiding classical normalization approaches of reducing or removing variation.
Results
DEcomposing heterogeneous Cohorts using Omic data profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic data dispersion and predictor–response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated data and five experimental transcriptomic datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification.
Availability and implementation
DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/).
Supplementary information
Supplementary data are available at Bioinformatics online.
The method is available on CRAN (http://cran.r-project.org/) in the open-source R package calmate, which also includes an add-on to the Aroma Project framework (http://www.aroma-project.org/).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.