This paper proposes a novel model estimation method, which uses nested Gibbs sampling to develop a mixture-of-
I . I N T R O D U C T I O NReal-world data often comprise a set of component features, such as images made of a set of pixels and speech comprising a set of frames. These data sets have a hierarchical structure, as illustrated in Fig. 1. We describe data such as images and speech as higher-and lower-level observations. For example, in speech data obtained from a multi-party conversation, higher-level observations correspond to each speaker's utterances, where their variation is caused by the differences in the speakers. Lower-level observations correspond to frame-wise observations comprising each utterance, where their variation is caused by the differences in the contents of speech. To cluster utterances by a speaker, we need to derive a suitable mathematical representation of an utterance for extracting each speaker's characteristics independently of the contents of the their speech [1].An effective approach for representing higher-level observations is modeling as stochastic distributions. Thus assume, we that each higher-level observation follows a 1 Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan 2 Mitsubishi Electric Research Laboratories, MA, USA.Corresponding author: N. Tawara Email: tawara@pcl.cs.waseda.ac.jp unique distribution, which represents each speaker's characteristics. Members of exponential families of distributions are employed widely to model higher-level observations due to their usefulness and analytical tractability. However, the underlying assumption of uni-modality for these distributions, is sometimes too restrictive. For example, frame-wise observations, short time fast Fourier transforms of acoustic signals, and filter responses in images are known to follow multi-modal distributions, which cannot be represented by unimodal distributions [2][3][4]. Mixture models are reasonable approximations for representing these multimodal distributions [5,6] and various distributions have been used as components of mixture models such as the t-distribution [7] and von Mises-Fisher distribution [8,9]. In particular, Gaussian distributions are used widely as a reasonable approximations for a wide class of probability distributions [10]. By using a mixture distribution to represent each cluster, the whole speaker space is modeled as a mixture of these mixture distributions. We refer to this as a mixture-of-mixture model. The optimal assignment of higher-level observations to clusters can be obtained by evaluating the posterior probability of assigning each observation to each cluster's mixture distribution. Thus, the clustering problem is reduced to the problem of estimating this mixture-of-mixture model.The concept of mixture-of-mixture modeling was introduced to analyze multi-modal data sample observations 1 https://www.cambridge.org/core/terms. https://doi