GkmExplain: fast and accurate interpretation of nonlinear gapped <i>k</i>-mer SVMs

Shrikumar, Avanti; Prakash, Eva; Kundaje, Anshul

doi:10.1093/bioinformatics/btz322

Cited by 49 publications

(55 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next, we used three complementary approaches, GkmExplain 37 , in silico mutagenesis 38 , and deltaSVM 39 to predict the allelic impact of 1677 candidate SNPs on chromatin accessibility in each cluster by providing the sequences corresponding to both alleles of each SN to the models for each of the 24 clusters. All three approaches showed high concordance of predicted allelic effects across all candidate SNPs (Supplementary Fig.…”

Section: Resultsmentioning

confidence: 99%

“…For each SNP in a peak in each of the clusters, we computed GkmExplain 37 importance scores for each position in each of the 1000 bp effect and non-effect allele sequences using each of the 10 gkm-SVM 36 models for the respective cluster. GkmExplain is a method to infer the importance or predictive contribution of every base in an input sequence to its corresponding output prediction from a gkm-SVM model.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Single-cell epigenomic identification of inherited risk loci in Alzheimer’s and Parkinson’s disease

Corces

Shcherbina

Kundu

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

42Genome-wide association studies (GWAS) have identified thousands of variants associated with 43 disease phenotypes. However, the majority of these variants do not alter coding sequences, making 44 it difficult to assign their function. To this end, we present a multi-omic epigenetic atlas of the 45 adult human brain through profiling of the chromatin accessibility landscapes and three-46 dimensional chromatin interactions of seven brain regions across a cohort of 39 cognitively healthy 47 individuals. Single-cell chromatin accessibility profiling of 70,631 cells from six of these brain 48 regions identifies 24 distinct cell clusters and 359,022 cell type-specific regulatory elements, 49 capturing the regulatory diversity of the adult brain. We develop a machine learning classifier to 50 integrate this multi-omic framework and predict dozens of functional single nucleotide 51 polymorphisms (SNPs), nominating gene and cellular targets for previously orphaned GWAS loci. 52These predictions both inform well-studied disease-relevant genes, such as BIN1 in microglia for 53 Alzheimer's disease (AD) and reveal novel gene-disease associations, such as STAB1 in microglia 54 and MAL in oligodendrocytes for Parkinson's disease (PD). Moreover, we dissect the complex 55 inverted haplotype of the MAPT (encoding tau) PD risk locus, identifying ectopic enhancer-gene 56 contacts in neurons that increase MAPT expression and may mediate this disease association. This 57 work greatly expands our understanding of inherited variation in AD and PD and provides a 58 roadmap for the epigenomic dissection of noncoding regulatory variation in disease. 59 60 61 62Alzheimer's disease (AD) and Parkinson's disease (PD) affect ~50 and ~10 million individuals 63 world-wide, as two of the most common neurodegenerative disorders. Several large consortia have 64 assembled genome-wide association studies (GWAS) that associate genetic variants with clinical 65 diagnoses of probable AD dementia 1-4 or probable PD 5-7 , or with their characteristic pathologic 66 features. These efforts have led to the identification of dozens of potential risk loci for these 67 prevalent neurodegenerative diseases. One goal of these studies was to build more precise 68 molecular biomarkers of AD or PD, efforts that are beginning to yield encouraging results with 69 polygenic risk scores 8 . The other major goal was to gain deeper insight into the molecular 70 pathogenesis of disease and thereby inform novel therapeutic targets. Some of the risk loci contain 71 coding variants and so have credibility as putative disease mediators. However, most risk loci are 72 in noncoding regions and so it remains unclear if the nominated (often nearest) gene is the 73 functional disease-relevant gene, or if some other gene is involved 9 . Furthermore, even if the 74 nominated gene is a true positive, the noncoding risk locus might regulate additional genes. These 75 challenges remain a fundamental gap in interpreting the etiology of neurodegenerative diseases 76 and d...

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Single-cell epigenomic identification of inherited risk loci in Alzheimer’s and Parkinson’s disease

Corces

Shcherbina

Kundu

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…While CNNs have the potential to outperform these simpler models, they require careful attention to the selection of adequate architectures and hyperparameter optimization. While not a focus of this work, models may be further interpreted with respect to their sequence features learned [41,42], in order to shed more light upon the sequence encoding of gene regulation.…”

Section: Discussionmentioning

confidence: 99%

The impact of different negative training data on regulatory sequence predictions

Krützfeldt

Schubach

Kircher

2020

Preprint

View full text Add to dashboard Cite

Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences.Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements’ relative activity as measured from independent experimental data.Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.

show abstract

“…We compare the predictive and attribution performance of SVM to CNN models. Details of training and attribution with mutagenesis and GkmExplain, an integrated gradient method [31], can be found in the supplementary material.…”

Section: K-mer-based Methodsmentioning

confidence: 99%

Uncovering tissue-specific binding features from differential deep learning

Phuycharoen¹,

Zarrineh

Bridoux³

et al. 2019

Preprint

View full text Add to dashboard Cite

Motivation: Transcription factors (TFs) can bind DNA in a cooperative manner, enabling a mutual increase in occupancy. Through this type of interaction, alternative binding sites can be preferentially bound in different tissues to regulate tissue-specific expression programmes. Recently, deep learning models have become state-of-the-art in various pattern analysis tasks, including applications in the field of genomics. We therefore investigate the application of convolutional neural network (CNN) models to the discovery of sequence features determining cooperative and differential TF binding across tissues. Results: We analyse ChIP-seq data from MEIS, TFs which are broadly expressed across mouse branchial arches, and HOXA2, which is expressed in the second and more posterior branchial arches. By developing models predictive of MEIS differential binding in all three tissues we are able to accurately predict HOXA2 co-binding sites. We evaluate transfer-like and multitask approaches to regularising the high-dimensional classification task with a larger regression dataset, allowing for creation of deeper and more accurate models. We test the performance of perturbation and gradient-based attribution methods in identifying the HOXA2 sites from differential MEIS data. Our results show that deep regularised models significantly outperform shallow CNNs as well as k-mer methods in the discovery of tissue-specific sites bound in vivo. Availability: For implementation and models please visit https://doi.org/10.5281/zenodo.2635463.

show abstract

GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs

Cited by 49 publications

References 10 publications

Single-cell epigenomic identification of inherited risk loci in Alzheimer’s and Parkinson’s disease

Single-cell epigenomic identification of inherited risk loci in Alzheimer’s and Parkinson’s disease

The impact of different negative training data on regulatory sequence predictions

Uncovering tissue-specific binding features from differential deep learning

Contact Info

Product

Resources

About