Gibbs sampling based Multi-scale Mixture Model for speaker clustering

Watanabe, Shinji; Mochihashi, Daichi; Hori, Takaaki; Nakamura, Atsushi

doi:10.1109/icassp.2011.5947360

Cited by 7 publications

(26 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A Bayesian approach can make the estimation of mixture-of-mixtures models more robust. For example, maximum a posterior (MAP) and variational Bayes (VB)-based methods have been applied to estimate the mixture of Gaussian mixture models (MoGMMs) [1,18,19]. However, the VB-based approach often still suffers from a large bias when the amount of data is limited [20].…”

Section: Introductionmentioning

confidence: 99%

“…In [13][14][15], an expectation-maximization (EM) approach [16] was used to estimate mixture-ofmixture models by augmenting observations with two-level (higher-level and lower-level) latent variables. However, this maximum-likelihood-based approach often suffers from an overfitting problem when applied to high-dimensional data [1,17]. A Bayesian approach can make the estimation of mixture-of-mixtures models more robust.…”

Section: Introductionmentioning

confidence: 99%

“…Thus assume, we that each higher-level observation follows a 1 Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan 2 Mitsubishi Electric Research Laboratories, MA, USA. unique distribution, which represents each speaker's characteristics.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Lower-level observations correspond to frame-wise observations comprising each utterance, where their variation is caused by the differences in the contents of speech. To cluster utterances by a speaker, we need to derive a suitable mathematical representation of an utterance for extracting each speaker's characteristics independently of the contents of the their speech [1].An effective approach for representing higher-level observations is modeling as stochastic distributions. Thus assume, we that each higher-level observation follows a 1 Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan 2 Mitsubishi Electric Research Laboratories, MA, USA.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

Tawara

Ogawa

Watanabe

et al. 2016

SIP

Self Cite

View full text Add to dashboard Cite

This paper proposes a novel model estimation method, which uses nested Gibbs sampling to develop a mixture-of- I . I N T R O D U C T I O NReal-world data often comprise a set of component features, such as images made of a set of pixels and speech comprising a set of frames. These data sets have a hierarchical structure, as illustrated in Fig. 1. We describe data such as images and speech as higher-and lower-level observations. For example, in speech data obtained from a multi-party conversation, higher-level observations correspond to each speaker's utterances, where their variation is caused by the differences in the speakers. Lower-level observations correspond to frame-wise observations comprising each utterance, where their variation is caused by the differences in the contents of speech. To cluster utterances by a speaker, we need to derive a suitable mathematical representation of an utterance for extracting each speaker's characteristics independently of the contents of the their speech [1].An effective approach for representing higher-level observations is modeling as stochastic distributions. Thus assume, we that each higher-level observation follows a 1 Department of Communications and Computer Engineering, Waseda University, Tokyo, Japan 2 Mitsubishi Electric Research Laboratories, MA, USA.Corresponding author: N. Tawara Email: tawara@pcl.cs.waseda.ac.jp unique distribution, which represents each speaker's characteristics. Members of exponential families of distributions are employed widely to model higher-level observations due to their usefulness and analytical tractability. However, the underlying assumption of uni-modality for these distributions, is sometimes too restrictive. For example, frame-wise observations, short time fast Fourier transforms of acoustic signals, and filter responses in images are known to follow multi-modal distributions, which cannot be represented by unimodal distributions [2][3][4]. Mixture models are reasonable approximations for representing these multimodal distributions [5,6] and various distributions have been used as components of mixture models such as the t-distribution [7] and von Mises-Fisher distribution [8,9]. In particular, Gaussian distributions are used widely as a reasonable approximations for a wide class of probability distributions [10]. By using a mixture distribution to represent each cluster, the whole speaker space is modeled as a mixture of these mixture distributions. We refer to this as a mixture-of-mixture model. The optimal assignment of higher-level observations to clusters can be obtained by evaluating the posterior probability of assigning each observation to each cluster's mixture distribution. Thus, the clustering problem is reduced to the problem of estimating this mixture-of-mixture model.The concept of mixture-of-mixture modeling was introduced to analyze multi-modal data sample observations 1 https://www.cambridge.org/core/terms. https://doi

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

Tawara

Ogawa

Watanabe

et al. 2016

SIP

Self Cite

View full text Add to dashboard Cite

show abstract

Bayesian approaches to acoustic modeling: a review

Watanabe

Nakamura

2012

SIP

Self Cite

View full text Add to dashboard Cite

This paper focuses on applications of Bayesian approaches to acoustic modeling for speech recognition and related speech-processing applications. Bayesian approaches have been widely studied in the fields of statistics and machine learning, and one of their advantages is that their generalization capability is better than that of conventional approaches (e.g., maximum likelihood). On the other hand, since inference in Bayesian approaches involves integrals and expectations that are mathematically intractable in most cases and require heavy numerical computations, it is generally difficult to apply them to practical speech recognition problems. However, there have been many such attempts, and this paper aims to summarize these attempts to encourage further progress on Bayesian approaches in the speech-processing field. This paper describes various applications of Bayesian approaches to speech processing in terms of the four typical ways of approximating Bayesian inferences, i.e., maximum a posteriori approximation, model complexity control using a Bayesian information criterion based on asymptotic approximation, variational approximation, and Markov chain Monte Carlo-based sampling techniques.

show abstract

Latent acoustic topic models for unstructured audio classification

Kim¹,

Georgiou²,

Narayanan³

2012

SIP

View full text Add to dashboard Cite

We propose the notion of latent acoustic topics to capture contextual information embedded within a collection of audio signals. The central idea is to learn a probability distribution over a set of latent topics of a given audio clip in an unsupervised manner, assuming that there exist latent acoustic topics and each audio clip can be described in terms of those latent acoustic topics. In this regard, we use the latent Dirichlet allocation (LDA) to implement the acoustic topic models over elemental acoustic units, referred as acoustic words, and perform text-like audio signal processing. Experiments on audio tag classification with the BBC sound effects library demonstrate the usefulness of the proposed latent audio context modeling schemes. In particular, the proposed method is shown to be superior to other latent structure analysis methods, such as latent semantic analysis and probabilistic latent semantic analysis. We also demonstrate that topic models can be used as complementary features to content-based features and offer about 9% relative improvement in audio classification when combined with the traditional Gaussian mixture model (GMM)–Support Vector Machine (SVM) technique.

show abstract

Gibbs sampling based Multi-scale Mixture Model for speaker clustering

Cited by 7 publications

References 14 publications

Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

Nested Gibbs sampling for mixture-of-mixture model and its application to speaker clustering

Bayesian approaches to acoustic modeling: a review

Latent acoustic topic models for unstructured audio classification

Contact Info

Product

Resources

About