DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning

Gosden, M; Downes, Damien J.; Brown, Richard C.; Telenius, Jelena; Teh, Yee Whye; Lunter, Gerton; Hughes, Jim R.

doi:10.1101/724005

Cited by 12 publications

(13 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also did not find much predictive gain in integrating local features from the two data sources, perhaps because local sequences were not informative enough for a higher prediction accuracy. We emphasize that, although our findings suggest that local DNA sequence data may not be sufficient to well predict EPIs, a new study has shown some promising results of using mega-base scale sequence data incorporating large-scale genomic context [39]; this is in agreement with improved prediction performance of including not only local epigenomic features of an enhancer and a promoter, but also the window region between them [40]. More studies are warranted.…”

Section: Discussionsupporting

confidence: 79%

Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

Xiao

Zhuang

Pan

2019

Genes

View full text Add to dashboard Cite

Enhancer-promoter interactions (EPIs) are crucial for transcriptional regulation. Mapping such interactions proves useful for understanding disease regulations and discovering risk genes in genome-wide association studies. Some previous studies showed that machine learning methods, as computational alternatives to costly experimental approaches, performed well in predicting EPIs from local sequence and/or local epigenomic data. In particular, deep learning methods were demonstrated to outperform traditional machine learning methods, and using DNA sequence data alone could perform either better than or almost as well as only utilizing epigenomic data. However, most, if not all, of these previous studies were based on randomly splitting enhancer-promoter pairs as training, tuning, and test data, which has recently been pointed out to be problematic; due to multiple and duplicating/overlapping enhancers (and promoters) in enhancer-promoter pairs in EPI data, such random splitting does not lead to independent training, tuning, and test data, thus resulting in model over-fitting and over-estimating predictive performance. Here, after correcting this design issue, we extensively studied the performance of various deep learning models with local sequence and epigenomic data around enhancer-promoter pairs. Our results confirmed much lower performance using either sequence or epigenomic data alone, or both, than reported previously. We also demonstrated that local epigenomic features were more informative than local sequence data. Our results were based on an extensive exploration of many convolutional neural network (CNN) and feed-forward neural network (FNN) structures, and of gradient boosting as a representative of traditional machine learning.Genes 2020, 11, 41 2 of 15 Experimental methods based on chromosome conformation capture (3C, 4C, and Hi-C) or extensions that incorporate ChIP-sequencing such as paired-end tag sequencing (ChIA-PET) are, however, costly, and the results are only available for a few cell types [4][5][6][7]. Computational tools offer an alternative by utilizing various DNA sequence and/or epigenomic annotation data to predict EPIs with machine learning models built from experimentally obtained EPI data [8][9][10][11].Whalen, et al. [11] reported that a gradient boosting method, called TargetFinder, accurately distinguished between interacting and non-interacting enhancer-promoter pairs based on epigenomic profiles. They included histone modifications and transcription factor binding (based on ChIP-seq), and DNase I hypersensitive sites (DNase-seq) with a focus on distal interaction (>10 kb) in high resolution. The idea was further extended to predict EPIs solely from local DNA sequence data and achieved high prediction accuracy [12][13][14].In particular, convolutional neural networks (CNNs), known for capturing stationary patterns in data with successful applications in image and text recognition [15,16], were shown to perform well in predicting EPIs based on DNA sequence alone. A natural questi...

show abstract

Section: Discussionsupporting

confidence: 79%

Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

Xiao

Zhuang

Pan

2019

Genes

View full text Add to dashboard Cite

show abstract

“…Variants within open chromatin were assessed for potential damage to transcription factor binding footprints using Sasquatch 20 (7-mer, WIMM Fibach Erythroid, Exhaustive). Variants within open chromatin were further classified based on their predicted effect on chromatin accessibility using a deep convolutional neural net 21 (deepHaem). Model architecture and data encoding were adapted from DeepSEA 36 with the following modifications.…”

Section: Methodsmentioning

confidence: 99%

“…The platform addresses the fact that both causal and non-causal variants may lie in open chromatin. Using DNaseI footprinting and a machine learning approach the platform prioritises variants predicted to directly affect the binding of transcription factors or alter chromatin accessibility 20,21 . Having prioritised putative regulatory causal variants, the platform then links the regulatory elements in which they occur to genes using NG Capture-C, the highest resolution chromatin conformation capture (3C) method currently available for targeting numerous loci 22,23 .…”

Section: Introductionmentioning

confidence: 99%

An integrated platform to systematically identify causal variants and genes for polygenic human traits

Downes

Hill

Nußbaum

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

33Japan 34 35 ABSTRACT 37 Genome-wide association studies (GWAS) have identified over 150,000 links between 38 common genetic variants and human traits or complex diseases. Over 80% of these 39 associations map to polymorphisms in non-coding DNA. Therefore, the challenge is 40 to identify disease-causing variants, the genes they affect, and the cells in which 41 these effects occur. We have developed a platform using ATAC-seq, DNaseI 42 footprints, NG Capture-C and machine learning to address this challenge. Applying 43 this approach to red blood cell traits identifies a significant proportion of known 44 causative variants and their effector genes, which we show can be validated by direct 45 in vivo modelling.Identification of the variation of the genome that determines the risk of common chronic and 48 infectious diseases informs on their primary causes, which leads to preventative or 49 therapeutic approaches and insights. Whilst genome-wide association studies (GWASs) 50 have identified thousands of chromosome regions 1 , the identification of the causal genes, 51 variants and cell types remains a major bottleneck. This is due to three major features of the 52 genome and its complex association with disease susceptibility. Trait-associated variants 53 are often tightly associated, through linkage disequilibrium (LD), with tens or hundreds of 54 other variants, mostly single-nucleotide polymorphisms (SNPs), any one or more of which 55 could be causal; the majority (>85%) the variants identified in GWAS lie within the non-56 coding genome 2 . Although non-coding regions are increasingly well annotated, many 57 variants do not correspond to known regulatory elements, and even when they do, it is rarely 58 known which genes these elements control, and in which cell types. New technical 59 approaches to link variants to the genes they control are rapidly improving but are often 60 limited by their sensitivity and resolution [3][4][5][6] ; and because so few causal variants have been 61 unequivocally linked to the genes they affect, the mechanisms by which non-coding variants 62 alter gene expression remain unknown in all but a few cases; and, third, the complexity of 63 gene regulation and cell/cell interactions means that knowing when in development, in which 64 cell type, in which activation state, and within which pathway(s) a causal variant exerts its 65 effect is usually impossible to predict. Although significant progress is being made, currently, 66 none of these problems has been adequately solved. 68Here, we have developed an integrated platform of experimental and computational 69 methods to prioritise likely causal variants, link them to the genes they regulate, and 70 determine the mechanism by which they alter gene function. To illustrate the approach we 71 have initially focussed on a single haematopoietic lineage: the development of mature red 72 blood cells (RBC), for which all stages of lineage specification and differentiation from a 73 haematopoietic stem cell to a RBC are known, and can be r...

show abstract

“…In Schwessinger et al 30 , the authors report successful predictions of Hi-C maps at 10kb resolution using a similar deep convolutional neural network approach, deepC. While deepC has a similar 'trunk' to Akita, it differs greatly in the architecture of the 'head', data preprocessing, and training schemes.…”

Section: Supplemental Note 2: Differences With Deepcmentioning

confidence: 99%

“…The architectures and layers that might best reflect the process of loop extrusion, believed to organize mammalian interphase chromosomes, 29 or other mechanisms of genome organization remain open questions. The near future promises exciting progress: recently, a similar CNN model, deepC, was posted to bioRxiv 30 . While deepC has a similar 'trunk' to Akita, it differs greatly in the architecture of the 'head', data pre-processing, and training schemes (Supplemental Note 2).…”

mentioning

confidence: 99%

Predicting 3D genome folding from DNA sequence

Fudenberg

Kelley

Pollard

2019

Preprint

View full text Add to dashboard Cite

In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Here we present a deep convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of CTCF and reveal a complex grammar underlying genome folding. Akita enables rapid in silico predictions for sequence mutagenesis, genome folding across species, and genetic variants. Main textRecent research has advanced our understanding of the proteins driving and the sequences underpinning 3D genome folding in mammalian interphase, including the interplay between CTCF and cohesin 1 , and their roles in development and disease 2 . Still, while disruptions of single bases can alter genome folding, in other cases genome folding is surprisingly resilient to large-scale deletions and structural variants 3,4 . As follows, predicting the consequences of perturbing any individual CTCF site, or other regulatory element, on local genome folding remains a challenge.Previous machine learning approaches have either: (1) relied on epigenomic information as inputs 5-7 , which does not readily allow for predicting effects of DNA variants, or (2) predicted derived features of genome folding (e.g. peaks 8,9 ), which depend heavily on minor algorithmic differences 10 . Making quantitative predictions from sequence poses a substantial challenge: base pair information must be propagated to megabase scales where locus-specific patterns become salient in chromosome contact maps.Convolutional neural networks (CNNs) have emerged as powerful tools for modelling genomic data as a function of DNA sequence, directly learning DNA sequence features from the data. CNNs now make state-of-the-art predictions for transcription factor binding, DNA accessibility, transcription, and RNA-binding [11][12][13][14] . DNA sequence features learned by CNNs can be subsequently post-processed into interpretable forms 15 . Recently, Basenji 16 demonstrated that CNNs can process very long sequences (~131kb) to learn distal regulatory element influences, suggesting that genome folding could be tractable with CNNs.Here we present Akita, a deep CNN to transform input DNA sequence into predicted locusspecific genome folding. Akita takes in ~1Mb (2 20 bp) of DNA sequence and predicts contact

show abstract

DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning

Cited by 12 publications

References 57 publications

Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

Local Epigenomic Data are more Informative than Local Genome Sequence Data in Predicting Enhancer-Promoter Interactions Using Neural Networks

An integrated platform to systematically identify causal variants and genes for polygenic human traits

Predicting 3D genome folding from DNA sequence

Contact Info

Product

Resources

About