“…Transfer learning has been shown to dramatically reduce the amount of training needed for related classification tasks and improves the overall predictive performance compared to training from scratch 28 . In the pre-training step, we trained a CNN on 4,863,024 1 kb sequences annotated with a total of 919 ChIP-seq and DNase-seq profiles collected from ENCODE 26 and the Epigenomics Roadmap Project 29 across dozens of cell types ( Methods ).…”
Section: Predicting Binding Status Of Transcription Factor Motif Occumentioning
Transcription factors (TFs) bind DNA by recognizing highly specific DNA sequence motifs, typically of length 6-12bp. A TF motif can occur tens of thousands of times in the human genome, but only a small fraction of those sites are actually bound. Despite the availability of genome-wide TF binding maps for hundreds of TFs, predicting whether a given motif occurrence is bound and identifying the influential context features remain challenging. Here we present a machine learning framework leveraging existing convolutional neural network architectures and state of the art model interpretation techniques to identify, visualize, and interpret context features most important for determining binding activity for a particular TF. We apply our framework to predict binding at motifs for 38 TFs in a lymphoblastoid cell line and achieve superior classification performance compared to existing frameworks. We compute importance scores for context regions at single base pair resolution and uncover known and novel determinants of TF binding. Finally, we demonstrate that important context bases are under increased purifying selection compared to nearby bases and are enriched in disease-associated variants identified by genome-wide association studies.
“…Transfer learning has been shown to dramatically reduce the amount of training needed for related classification tasks and improves the overall predictive performance compared to training from scratch 28 . In the pre-training step, we trained a CNN on 4,863,024 1 kb sequences annotated with a total of 919 ChIP-seq and DNase-seq profiles collected from ENCODE 26 and the Epigenomics Roadmap Project 29 across dozens of cell types ( Methods ).…”
Section: Predicting Binding Status Of Transcription Factor Motif Occumentioning
Transcription factors (TFs) bind DNA by recognizing highly specific DNA sequence motifs, typically of length 6-12bp. A TF motif can occur tens of thousands of times in the human genome, but only a small fraction of those sites are actually bound. Despite the availability of genome-wide TF binding maps for hundreds of TFs, predicting whether a given motif occurrence is bound and identifying the influential context features remain challenging. Here we present a machine learning framework leveraging existing convolutional neural network architectures and state of the art model interpretation techniques to identify, visualize, and interpret context features most important for determining binding activity for a particular TF. We apply our framework to predict binding at motifs for 38 TFs in a lymphoblastoid cell line and achieve superior classification performance compared to existing frameworks. We compute importance scores for context regions at single base pair resolution and uncover known and novel determinants of TF binding. Finally, we demonstrate that important context bases are under increased purifying selection compared to nearby bases and are enriched in disease-associated variants identified by genome-wide association studies.
“…The mutation map allows assessing the relative importance of variants compared with other possible variants in the vicinity. The MMSplice implementation followed the Kipoi API (version 0.65), a programmatic standard for predictive models in genomics (Avsec et al, ). In particular, it is compatible with the Kipoi variant effect prediction plugin allowing the generation of mutation maps.…”
Pathogenic genetic variants often primarily affect splicing. However, it remains difficult to quantitatively predict whether and how genetic variants affect splicing. In 2018, the fifth edition of the Critical Assessment of Genome Interpretation proposed two splicing prediction challenges based on experimental perturbation assays: Vex‐seq, assessing exon skipping, and MaPSy, assessing splicing efficiency. We developed a modular modeling framework, MMSplice, the performance of which was among the best on both challenges. Here we provide insights into the modeling assumptions of MMSplice and its individual modules. We furthermore illustrate how MMSplice can be applied in practice for individual genome interpretation, using the MMSplice VEP plugin and the Kipoi variant interpretation plugin, which are directly applicable to VCF files.
“…These models have been integrated into the Kipoi API [30], allowing them to be applied with very little overhead to a VCF file containing human variant data (see also Figure 6). As a result the models are easy to use and straightforward to integrate into existing variant annotation pipelines.…”
Section: Modelling 5'utr Of Any Length Using Frame Poolingmentioning
The 5' untranslated region plays a key role in regulating mRNA translation and consequently protein abundance. Therefore, accurate modeling of 5'UTR regulatory sequences shall provide insights into translational control mechanisms and help interpret genetic variants. Recently, a model was trained on a massively parallel reporter assay to predict mean ribosome load (MRL) -a proxy for translation rate -directly from 5'UTR sequence with a high degree of accuracy. However, this model is restricted to sequence lengths investigated in the reporter assay and therefore cannot be applied to the majority of human sequences without a substantial loss of information. Here, we introduced frame pooling, a novel neural network operation that enabled the development of an MRL prediction model for 5'UTRs of any length. Our model shows state-of-the-art performance on fixed length randomized sequences, while offering better generalization performance on longer sequences and on a variety of translation-related genome-wide datasets. Variant interpretation is demonstrated on a 5'UTR variant of the gene HBB associated with beta-thalassemia. Frame pooling could find applications in other bioinformatics predictive tasks. Moreover, our model, released open source, could help pinpoint pathogenic genetic variants.Recently a massively parallel reporter assay (MPRA) has been developed which provided a
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.