We are concerned with the issue of detecting changes of clustering structures from multivariate time series. From the viewpoint of the minimum description length (MDL) principle, we propose an algorithm that tracks changes of clustering structures so that the sum of the code-length for data and that for clustering changes is minimum. Here we employ a Gaussian mixture model (GMM) as representation of clustering, and compute the code-length for data sequences using the normalized maximum likelihood (NML) coding. The proposed algorithm enables us to deal with clustering dynamics including merging, splitting, emergence, disappearance of clusters from a unifying view of the MDL principle. We empirically demonstrate using artificial data sets that our proposed method is able to detect cluster changes significantly more accurately than an existing statistical-test based method and AIC/BIC-based methods. We further use real customers' transaction data sets to demonstrate the validity of our algorithm in market analysis. We show that it is able to detect changes of customer groups, which correspond to changes of real market environments.
This paper addresses the issue of estimating from a given data sequence the number of mixture components for a Gaussian mixture model. Our approach is to compute the normalized maximum likelihood (NML) code-length for the data sequence relative to a Gaussian mixture model, then to find the mixture size that attains the minimum of the NML. Here the minimization of the NML code-length is known as Rissanen's minimum description length (MDL) principle. For discrete domains, Kontkanen and Myllymäki proposed a method of efficient computation of the NML code-length for specific models, however, for continuous domains it has remained open how we compute the NML codelength efficiently. We propose a method for efficient computation of the NML code-length for Gaussian mixture models. We develop it using an approximation of the NML code-length under the restriction of the domain and using the technique of a generating function. We apply it to the issue of determining the optimal number of clusters in clustering using a Gaussian mixture model, where the mixture size is the number of clusters. We use artificial data sets and benchmark data sets to empirically demonstrate that our estimate of the mixture size converges to the true one significantly faster than AIC and BIC.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.