STEME: efficient EM to find motifs in large data sets

Motif finding is a difficult problem that has been studied for over 20 years. Some older popular motif finders are not suitable for analysis of the large data sets generated by next-generation sequencing. We recently published an efficient approximation (STEME) to the EM algorithm that is at the core of many motif finders such as MEME. This approximation allows the EM algorithm to be applied to large data sets. In this work we describe several efficient extensions to STEME that are based on the MEME algorithm. Together with the original STEME EM approximation, these extensions make STEME a fully-fledged motif finder with similar properties to MEME. We discuss the difficulty of objectively comparing motif finders. We show that STEME performs comparably to existing prominent discriminative motif finders, DREME and Trawler, on 13 sets of transcription factor binding data in mouse ES cells. We demonstrate the ability of STEME to find long degenerate motifs which these discriminative motif finders do not find. As part of our method, we extend an earlier method due to Nagarajan et al. for the efficient calculation of motif E-values. STEME's source code is available under an open source license and STEME is available via a web interface.

show abstract

“…STEME's EM algorithm implementation has been described in some detail in the original STEME paper [14], so we do not repeat it here.…”

Section: Methodsmentioning

confidence: 99%

STEME: A Robust, Accurate Motif Finder for Large Data Sets

Reid

Wernisch

2014

PLoS ONE

Self Cite

View full text Add to dashboard Cite

show abstract

“…Both the suffix tree model [43,44] and projec-tion [45,46] model are used as data structures to look for high quality motifs. Meanwhile, the GA (Genetic Algorithm) [47,48] is applied in the PWM model to help it improve performance and Reid et al [49] have combined the GA algorithm with the suffix tree model to make optimizations and advancements. Based on gradually developed Phylogenetic footprinting [50][51][52][53][54] and ChIP-seq techniques [55][56][57] lots of related methods for discovering TFBSs have been put forward.…”

Section: Various Algorithms and Techniques Applying For Tfbss Predictionmentioning

confidence: 99%

Noncoding Variants Functional Prioritization Methods Based on Predicted Regulatory Factor Binding Sites

Yang²,

Zhang

2017

View full text Add to dashboard Cite

Backgrounds: With the advent of the post genomic era, the research for the genetic mechanism of the diseases has found to be increasingly depended on the studies of the genes, the gene-networks and gene-protein interaction networks. To explore gene expression and regulation, the researchers have carried out many studies on transcription factors and their binding sites (TFBSs). Based on the large amount of transcription factor binding sites predicting values in the deep learning models, further computation and analysis have been done to reveal the relationship between the gene mutation and the occurrence of the disease. It has been demonstrated that based on the deep learning methods, the performances of the prediction for the functions of the noncoding variants are outperforming than those of the conventional methods. The research on the prediction for functions of Single Nucleotide Polymorphisms (SNPs) is expected to uncover the mechanism of the gene mutation affection on traits and diseases of human beings. Results: We reviewed the conventional TFBSs identification methods from different perspectives. As for the deep learning methods to predict the TFBSs, we discussed the related problems, such as the raw data preprocessing, the structure design of the deep convolution neural network (CNN) and the model performance measure et al. And then we summarized the techniques that usually used in finding out the functional noncoding variants from de novo sequence. Conclusion: Along with the rapid development of the high-throughout assays, more and more sample data and chromatin features would be conducive to improve the prediction accuracy of the deep convolution neural network for TFBSs identification. Meanwhile, getting more insights into the deep CNN framework itself has been proved useful for both the promotion on model performance and the development for more suitable design to sample data. Based on the feature values predicted by the deep CNN model, the prioritization model for functional noncoding variants would contribute to reveal the affection of gene mutation on the diseases.

show abstract

“…Given a set of DNA sequences, these programs search for characteristic motifs using two major approaches: profile-based and consensus-based methods. In particular, the recent increase in data size due to the ChIP-Seq technique led to the development of methods that can accept greater than thousands of DNA sequences (Sharov and Ko, 2009;Li, 2008;Heinz et al, 2010;Kulakovskiy et al, 2010;Bailey, 2011;Machanick and Bailey, 2011;Reid and Wernisch, 2011;Ma et al, 2012;Hartmann et al, 2013). Most of these software programs focused on reductions in computational time, for example, by subsampling input data (e.g., MEME-ChIP (Machanick and Bailey, 2011)), accelerating expectation-maximization steps in profile optimization (e.g., ChIPMunk (Kulakovskiy et al, 2010) and STEME (Reid and Wernisch, 2011), which use a greedy approach and suffix array, respectively), or using enriched sequences as starting points for the motif search (e.g., DREME (Bailey, 2011), cERMIT (Georgiev et al, 2010), and HOMER (Heinz et al, 2010)).…”

Section: Introductionmentioning

confidence: 99%

“…Whereas existing methods can partly represent motif ambiguity, these methods are unable to directly answer whether a given TF binds to a specific sequence pattern. For example, the most popular software program MEME (Bailey and Elkan, 1994) and other recently developed software programs (Reid and Wernisch, 2011;Zhang et al, 2013), which adopt the expectation-maximization algorithm, iteratively enrich DNA sequences that contain possible DNA-binding motifs and often converges to a local optimum. Given the nature of this algorithm, discovered DNA-binding motifs can miss non-canonical but significant motifs that were removed from the enriched dataset during the computation.…”

Section: Introductionmentioning

confidence: 99%