AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification

Hiranuma, Naozumi; Lundberg, Scott; Lee, Su-In

doi:10.1101/278762

Cited by 2 publications

(2 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…But plausibly, some kinds of peaks or some genomic regions are more likely to be false positives than others. Indeed, other ongoing work in ChIP-seq analysis aims at uncovering and removing local biases in ChIP-seq signals that can unduly influence peak calling (Hiranuma et al , 2016, 2018; Ramachandran et al , 2015). This suggests that peak-specific P -value corrections might be desirable, although it is unclear how this can best be done.…”

Section: Discussionmentioning

confidence: 99%

RECAP reveals the true statistical significance of ChIP-seq peak calls

Chitpin

Perkins

2019

Bioinformatics

View full text Add to dashboard Cite

Motivation Chromatin Immunopreciptation (ChIP)-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. A key step in ChIP-seq analysis is peak calling, where genomic regions enriched for ChIP versus control reads are identified. Many programs have been designed to solve this task, but nearly all fall into the statistical trap of using the data twice—once to determine candidate enriched regions, and again to assess enrichment by classical statistical hypothesis testing. This double use of the data invalidates the statistical significance assigned to enriched regions, thus the true significance or reliability of peak calls remains unknown. Results Using simulated and real ChIP-seq data, we show that three well-known peak callers, MACS, SICER and diffReps, output biased P-values and false discovery rate estimates that can be many orders of magnitude too optimistic. We propose a wrapper algorithm, RECAP, that uses resampling of ChIP-seq and control data to estimate a monotone transform correcting for biases built into peak calling algorithms. When applied to null hypothesis data, where there is no enrichment between ChIP-seq and control, P-values recalibrated by RECAP are approximately uniformly distributed. On data where there is genuine enrichment, RECAP P-values give a better estimate of the true statistical significance of candidate peaks and better false discovery rate estimates, which correlate better with empirical reproducibility. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability and implementation The RECAP software is available through www.perkinslab.ca or on github at https://github.com/theodorejperkins/RECAP. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Discussionmentioning

confidence: 99%

RECAP reveals the true statistical significance of ChIP-seq peak calls

Chitpin

Perkins

2019

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…For example, one could use generative adversarial networks (GANs) to generate data with the properties of real data and then use the created data to normalize the real data. Future approaches may include integrated strategies, where normalization is intrinsic to a specific type of analysis (e.g., [343]), and generic tools, which normalize the data that can then be used as input to any downstream analysis (e.g., [344,345,346]).…”

Section: Combining Mixed-technology Datamentioning

confidence: 99%

Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities

Žitnik

Nguyen

Wang³

et al. 2019

Information Fusion

383

278

View full text Add to dashboard Cite

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.

show abstract

RECAP reveals the true statistical significance of ChIP-seq peak calls

Chitpin

Perkins

2018

Preprint

View full text Add to dashboard Cite

Motivation: ChIP-seq is used extensively to identify sites of transcription factor binding or regions of epigenetic modifications to the genome. The fundamental bioinformatics problem is to take ChIP-seq read data and data representing some kind of control, and determine genomic regions that are enriched in the ChIP-seq versus the control, also called "peak calling." While many programs have been designed to solve this task, nearly all fall into the statistical trap of using the data twice-once to determine candidate enriched regions, and a second time to assess enrichment by methods of classical statistical hypothesis testing. This double use of the data has the potential to invalidate the statistical significance assigned to enriched regions, or "peaks", and as a consequence, to invalidate false discovery rate estimates. Thus, the true significance or reliability of peak calls remains unknown. Results: We show, through extensive simulation studies of null hypothesis data, that three well-known peak callers, MACS, SICER and diffReps, output optimistically biased p-values, and therefore optimistic false discovery rate estimates-in some cases, orders of magnitude optimistic. We also propose a new wrapper algorithm called RECAP, that uses resampling of ChIP-seq and control data to estimate and correct for biases built into peak calling algorithms. RECAP also enables for the first time local false discovery rate analysis, so that the likelihood of individual peaks being true positives or false positives can be estimated based on their re-calibrated p-values. RECAP is a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls. Availability: The RECAP software is available at www.perkinslab.ca.

show abstract

AIControl: Replacing matched control experiments with machine learning improves ChIP-seq peak identification

Cited by 2 publications

References 52 publications

RECAP reveals the true statistical significance of ChIP-seq peak calls

RECAP reveals the true statistical significance of ChIP-seq peak calls

Machine learning for integrating data in biology and medicine: Principles, practice, and opportunities

RECAP reveals the true statistical significance of ChIP-seq peak calls

Contact Info

Product

Resources

About