Evaluation of methods for modeling transcription factor sequence specificity

Weirauch, Matthew T.; Coté, Atina G.; Norel, Raquel; Annala, Matti; Zhao, Yue; Riley, Todd; Sáez-Rodríguez, Julio; Cokelaer, Thomas; Vedenko, Anastasia; Talukder, Shaheynoor; Bussemaker, Harmen J.; Morris, Quaid; Bulyk, Martha L.; Stolovitzky, Gustavo; Hughes, Timothy R.

doi:10.1038/nbt.2486

Cited by 347 publications

(505 citation statements)

References 55 publications

Supporting

Mentioning

486

Contrasting

Unclassified

Order By: Relevance

“…2). The best method reported in the original evaluation (Team_D, a k-mer-based model) and the best reported in the revised evaluation (FeatureREDUCE, a hybrid PWM/k-mer model) both had reasonable, but not the best, performance on in vivo data, which might be due to overfitting to PBM noise 17 . …”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 93%

“…Ascertaining DNA sequence specificities To evaluate DeepBind's ability to characterize DNA-binding protein specificity, we used PBM data from the revised DREAM5 TF-DNA Motif Recognition Challenge by Weirauch et al 17 . The PBM data represent 86 different mouse transcription factors, each measured using two independent array designs.…”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

“…Weirauch et al 17 20 and Seed-and-Wobble 21 . For each individual algorithm, they optimized the data preprocessing steps to attain best test performance.…”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

“…For each individual algorithm, they optimized the data preprocessing steps to attain best test performance. Methods were evaluated using the Pearson correlation between the predicted and actual probe intensities, and values from the area under the receiver operating characteristic (ROC) curve (AUC) computed by setting high-intensity probes as positives and the remaining probes as negatives 17 . To the best of our knowledge, this is the largest independent evaluation of this type.…”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

“…To assess the ability of DeepBind models trained using in vitro PBM data to predict sequence specificities measured using in vivo ChIP-seq data, we followed the method described by Weirauch et al 17 . Predicting transcription factor binding in vivo is more difficult because it is affected by other proteins, the chromatin state and the physical accessibility of the binding site.…”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

See 4 more Smart Citations

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

et al. 2015

Self Cite

View full text Add to dashboard Cite

Knowing the sequence specificities of DNA-and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.DNA-and RNA-binding proteins play a central role in gene regulation, including transcription and alternative splicing. The sequence specificities of a protein are most commonly characterized using position weight matrices 1 (PWMs), which are easy to interpret and can be scanned over a genomic sequence to detect potential binding sites. However, growing evidence indicates that sequence specificities can be more accurately captured by more complex techniques 2-5 . Recently, 'deep learning' has achieved record-breaking performance in a variety of information technology applications 6,7 . We adapted deep learning methods to the task of predicting sequence specificities and found that they compete favorably with the state of the art. Our approach, called DeepBind, is based on deep convolutional neural networks and can discover new patterns even when the locations of patterns within sequences are unknown-a task for which traditional neural networks require an exorbitant amount of training data.There are several challenging aspects in learning models of sequence specificity using modern high-throughput technologies. First, the data come in qualitatively different forms. Protein binding microarrays (PBMs) 8 and RNAcompete assays 9 provide a specificity coefficient for each probe sequence, whereas chromatin immunoprecipitation (ChIP)-seq 10 provides a ranked list of putatively bound sequences of varying length, and HT-SELEX 11 generates a set of very high affinity sequences. Second, the quantity of data is large. A typical high-throughput experiment measures between 10,000 and 100,000 sequences, and it is computationally demanding to incorporate them all. Third, each data acquisition technology has its own artifacts, biases and limitations, and we must discover the pertinent specificities despite these unwanted effects. For example, ChIP-seq reads often localize to "hyper-ChIPable" regions of the genome near highly expressed genes 12 .DeepBind (Fig. 1) addresses the above challenges. (i) It can be applied to both microarray and sequencing data; (ii) it can learn from millions of...

show abstract

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 93%

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

“…Weirauch et al 17 20 and Seed-and-Wobble 21 . For each individual algorithm, they optimized the data preprocessing steps to attain best test performance.…”

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

Section: Training Deepbind and Scoring Sequencesmentioning

confidence: 99%

See 3 more Smart Citations

Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning

et al. 2015

Self Cite

View full text Add to dashboard Cite

show abstract

DNA Motif Databases and Their Uses

Stormo

2015

CP in Bioinformatics

View full text Add to dashboard Cite

Transcription factors (TFs) recognize and bind to specific DNA sequences. The specificity of a TF is usually represented as a position weight matrix (PWM). Several databases of DNA motifs exist and are used in biological research to address important biological questions. This overview describes PWMs and some of the most commonly used motif databases, as well as a few of their common applications. © 2015 by John Wiley & Sons, Inc.

show abstract