Constraint-based mining techniques on sequence databases have been studied extensively the last few years and efficient algorithms enable to compute complete collections of patterns (e.g., sequences) which satisfy conjunctions of monotonic and/or anti-monotonic constraints. Studying new applications of these techniques, we believe that a primitive constraint which enforces enough similarity w.r.t a given reference sequence would be extremely useful and should benefit from such a recent algorithmic breakthrough. A non trivial similarity constraint is however neither monotonic nor anti-monotonic. Therefore, we have studied its definition as a conjunction of two constraints which satisfy the desired monotonicity properties: a pattern is called similar to a reference pattern x when its longest common subsequence with x (LCS) is large enough (i.e., a monotonic part) and when the number of deletions such that it becomes the LCS is small enough (i.e., an anti-monotonic part). We provide an experimental validation which confirms the added value of this approach on a biological database. Classical issues like scalability and pruning efficiency are discussed.
In many application domains (e.g., WWW usage mining, telecommunication data analysis, molecular biology), large sequence databases are available and yet under-exploited. The inductive database framework assumes that both such databases and the various patterns holding within them might be queryable. In this setting, queries which return patterns are called inductive queries and solving them is one of the main topics in database mining research. Indeed, constraint-based mining techniques on sequence databases have been studied extensively the last few years and efficient algorithms enable to compute complete collections of patterns (e.g., sequences) which satisfy conjunctions of monotonic and/or anti-monotonic constraints in potentially large sequence databases (e.g., minimal and maximal frequency constraints). Studying new applications of these techniques, we consider that fault-tolerance and softness are extremely important issues for tackling real-life data analysis. In this paper, we address some of the open problems when computing soft occurrences of patterns within database sequences instead of the classical exact matching ones. Such an extension is not trivial since it prevents the clever use of monotonicity for pruning the search space. We describe our proposal and we provide an experimental validation on real-life clickstream data which confirms the added value of this approach.
There is a critical need for new and efficient computational methods aimed at discovering putative transcription factor binding sites (TFBSs) in promoter sequences. Among the existing methods, two families can be distinguished: statistical or stochastic approaches, and combinatorial approaches. Here we focus on a complete approach incorporating a combinatorial exhaustive motif extraction, together with a statistical Twilight Zone Indicator (TZI), in two datasets: a positive set and a negative one, which represents the result of a classical differential expression experiment. Our approach relies on the existence of prior biological information in the form of two sets of promoters of differentially expressed genes. We describe the complete procedure used for extracting either exact or degenerated motifs, ranking these motifs, and finding their known related TFBSs. We exemplify this approach using two different sets of promoters. The first set consists in promoters of genes either repressed or not by the transforming form of the v-erbA oncogene. The second set consists in genes the expression of which varies between self-renewing and differentiating progenitors. The biological meaning of the found TFBSs is discussed and, for one TF, its biological involvement is demonstrated. This study therefore illustrates the power of using relevant biological information, in the form of a set of differentially expressed genes that is a classical outcome in most of transcriptomics studies. This allows to severely reduce the search space and to design an adapted statistical indicator. Taken together, this allows the biologist to concentrate on a small number of putatively interesting TFs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.