Abstract:Time series motif discovery is the task of extracting previously unknown recurrent patterns from time series data. It is an important problem within applications that range from finance to health. Many algorithms have been proposed for the task of efficiently finding motifs. Surprisingly, most of these proposals do not focus on how to evaluate the discovered motifs. They are typically evaluated by human experts. This is unfeasible even for moderately sized datasets, since the number of discovered motifs tends … Show more
“…Instead, we have assumed the same length for the pair of segments forming a motif pair. This assumption is well motivated, as practically all existing motif discovery algorithms operate under such constraint (e.g., Lin et al, 2002;Chiu et al, 2003;Tanaka et al, 2005;Mueen et al, 2009;Castro & Azevedo, 2011;Mueen, 2013;Yingchareonthawornchai et al, 2013). It is also motivated for the case where we are interested in pairs of segments of different length, as the most common way to compute the dissimilarity between such segments is by re-sampling them to have the same length.…”
The detection of very similar patterns in a time series, commonly called motifs, has received continuous and increasing attention from diverse scientific communities. In particular, recent approaches for discovering similar motifs of different lengths have been proposed. In this work, we show that such variable-length similarity-based motifs cannot be directly compared, and hence ranked, by their normalized dissimilarities. Specifically, we find that length-normalized motif dissimilarities still have intrinsic dependencies on the motif length, and that lowest dissimilarities are particularly affected by this dependency. Moreover, we find that such dependencies are generally non-linear and change with the considered data set and dissimilarity measure. Based on these findings, we propose a solution to rank those motifs and measure their significance. This solution relies on a compact but accurate model of the dissimilarity space, using a beta distribution with three parameters that depend on the motif length in a non-linear way. We believe the incomparability of variable-length dissimilarities could go beyond the field of time series, and that similar modeling strategies as the one used here could be of help in a more broad context.
“…Instead, we have assumed the same length for the pair of segments forming a motif pair. This assumption is well motivated, as practically all existing motif discovery algorithms operate under such constraint (e.g., Lin et al, 2002;Chiu et al, 2003;Tanaka et al, 2005;Mueen et al, 2009;Castro & Azevedo, 2011;Mueen, 2013;Yingchareonthawornchai et al, 2013). It is also motivated for the case where we are interested in pairs of segments of different length, as the most common way to compute the dissimilarity between such segments is by re-sampling them to have the same length.…”
The detection of very similar patterns in a time series, commonly called motifs, has received continuous and increasing attention from diverse scientific communities. In particular, recent approaches for discovering similar motifs of different lengths have been proposed. In this work, we show that such variable-length similarity-based motifs cannot be directly compared, and hence ranked, by their normalized dissimilarities. Specifically, we find that length-normalized motif dissimilarities still have intrinsic dependencies on the motif length, and that lowest dissimilarities are particularly affected by this dependency. Moreover, we find that such dependencies are generally non-linear and change with the considered data set and dissimilarity measure. Based on these findings, we propose a solution to rank those motifs and measure their significance. This solution relies on a compact but accurate model of the dissimilarity space, using a beta distribution with three parameters that depend on the motif length in a non-linear way. We believe the incomparability of variable-length dissimilarities could go beyond the field of time series, and that similar modeling strategies as the one used here could be of help in a more broad context.
“…So far, this had been done by meticulous visual inspection, which is bounded by the complexity of the data and the inherent biases of our perception. Relying on our time series representation, these explorations could be done using de-novo motif discovery algorithms, in which a sequence dataset is searched for statistically overrepresented segments in a fast, systematic, and unbiased manner [ 53 , 54 ]. Such modular decomposition approaches proved to be transformative in dealing with large volumes of data from sequencing and structural studies of DNA, RNA, and proteins [ 55 – 57 ].…”
Tissue morphogenesis relies on repeated use of dynamic behaviors at the levels of intracellular structures, individual cells, and cell groups. Rapidly accumulating live imaging datasets make it increasingly important to formalize and automate the task of mapping recurrent dynamic behaviors (motifs), as it is done in speech recognition and other data mining applications. Here, we present a "template-based search" approach for accurate mapping of sub-to multi-cellular morphogenetic motifs using a time series data mining framework. We formulated the task of motif mapping as a subsequence matching problem and solved it using dynamic time warping, while relying on high throughput graph-theoretic algorithms for efficient exploration of the search space. This formulation allows our algorithm to accurately identify the complete duration of each instance and automatically label different stages throughout its progress, such as cell cycle phases during cell division. To illustrate our approach, we mapped cell intercalations during germband extension in the early Drosophila embryo. Our framework enabled statistical analysis of intercalary cell behaviors in wild-type and mutant embryos, comparison of temporal dynamics in contracting and growing junctions in different genotypes, and the identification of a novel mode of iterative cell intercalation. Our formulation of tissue morphogenesis using time series opens new avenues for systematic decomposition of tissue morphogenesis.
“…Approximate fixed-length motif discovery is largely based upon random projection (CK Algorithm [14]) and Symbolic Aggregate Approximation or SAX [2,15] techniques (discussed further in Section 2.1.1). Of note is the use of iSAX in the MrMotif [16,17] algorithm that derives a set of top-K motifs for a fixed length through increasing SAX resolutions.…”
As the availability of big data-sets becomes more widespread so the importance of motif (or repeated pattern) identification and analysis increases. To date, the majority of motif identification algorithms that permit flexibility of sub-sequence length do so over a given range, with the restriction that both sides of an identified sub-sequence pair are of equal length. In this article, motivated by a better localised representation of variations in time series, a novel approach to the identification of motifs is discussed, which allows for some flexibility in side-length. The advantages of this flexibility include improved recognition of localised similar behaviour (manifested as motif shape) over varying timescales. As well as facilitating improved interpretation of localised volatility patterns and a visual comparison of relative volatility levels of series at a globalised level. The process described extends and modifies established techniques, namely SAX, MDL and the Matrix Profile, allowing advantageous properties of leading algorithms for data analysis and dimensionality reduction to be incorporated and future-proofed. Although this technique is potentially applicable to any time series analysis, the focus here is financial and energy sector applications where real-world examples examining S&P500 and Open Power System Data are also provided for illustration.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.