Using aggregation to improve the performance of mixture Gaussian acoustic models

Hazen, Timothy J.; Halberstdt, A.K.

doi:10.1109/icassp.1998.675349

Cited by 10 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the EM algorithm is only guaranteed to converge to a local maximum, the final model parameters are highly dependent on the initial conditions obtained from the K-means clustering. To improve the performance and robustness of the mixture models, we used a technique called aggregation (Hazen and Halberstadt 1998), which is described in Section 4.2.…”

Section: Speech Recognition Systemmentioning

confidence: 99%

Subword-based approaches for spoken document retrieval

Zue

2000

Speech Communication

148

109

View full text Add to dashboard Cite

This thesis explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. Our investigation is motivated by the observation that word-based retrieval approaches face the problem of either having to know the keywords to search for a priori, or requiring a very large recognition vocabulary in order to cover the contents of growing and diverse message collections. The use of subword units in the recognizer constrains the size of the vocabulary needed to cover the language; and the use of subword units as indexing terms allows for the detection of new user-specified query terms during retrieval.Four research issues are addressed. First, what are suitable subword units and how well can they perform? Second, how can these units be reliably extracted from the speech signal? Third, what is the behavior of the subword units when there are speech recognition errors and how well do they perform? And fourth, how can the indexing and retrieval methods be modified to take into account the fact that the speech recognition output will be errorful?We first explore a range of subword units of varying complexity derived from error-free phonetic transcriptions and measure their ability to effectively index and retrieve speech messages. We find that many subword units capture enough information to perform effective retrieval and that it is possible to achieve performance comparable to that of text-based word units. Next, we develop a phonetic speech recognizer and process the spoken document collection to generate phonetic transcriptions. We then measure the ability of subword units derived from these transcriptions to perform spoken document retrieval and examine the effects of recognition errors on retrieval performance. Retrieval performance degrades for all subword units (to 60% of the clean reference), but remains reasonable for some subword units even without the use of any error compensation techniques. We then investigate a number of robust methods that take into account the characteristics of the recognition errors and try to compensate for them in an effort to improve spoken document retrieval performance when there are speech recognition errors. We study the methods individually and explore the effects of combining them. Using these robust methods improves retrieval performance by 23%. We also propose a novel approach to SDR where the speech recognition and information retrieval components are more tightly integrated. This is accomplished by developing new recognizer and retrieval models where the interface between the two 3 components is better matched and the goals of the two components are consistent with each other and with the overall goal of the combine...

show abstract

Section: Speech Recognition Systemmentioning

confidence: 99%

Subword-based approaches for spoken document retrieval

Zue

2000

Speech Communication

148

109

View full text Add to dashboard Cite

show abstract

“…3) A k-means procedure is applied to cluster Gaussian mixture components into each node. In each iteration, when KL divergence is used, the mean and variance can be updated either by the ML approach [(10), (11)] or by the KL approach [(14), (15)]. Similarly, the ML or BH approach [(16), (17)] can be applied when Bhattacharyya distance is chosen.…”

Section: Tree Constructionmentioning

confidence: 99%

“…In [9] and [10], approaches based on tree-structured Gaussian densities were proposed to achieve computational efficiency in speech recognition. A tree structure with bottom-up clustering was also proposed in [11] for purpose of pruning the aggregated Gaussian models. In [12], a decision-tree technique was proposed to partition the feature space hierarchically.…”

mentioning

confidence: 99%

Efficient text-independent speaker verification with structural gaussian mixture models and neural network

Xiang

Berger

2003

IEEE Trans. Speech Audio Process.

View full text Add to dashboard Cite

We present an integrated system with structural Gaussian mixture models (SGMMs) and a neural network for purposes of achieving both computational efficiency and high accuracy in text-independent speaker verification. A structural background model (SBM) is constructed first by hierarchically clustering all Gaussian mixture components in a universal background model (UBM). In this way the acoustic space is partitioned into multiple regions in different levels of resolution. For each target speaker, a SGMM can be generated through multilevel maximum a posteriori (MAP) adaptation from the SBM. During test, only a small subset of Gaussian mixture components are scored for each feature vector in order to reduce the computational cost significantly. Furthermore, the scores obtained in different layers of the tree-structured models are combined via a neural network for final decision. Different configurations are compared in the experiments conducted on the telephony speech data used in the NIST speaker verification evaluation. The experimental results show that computational reduction by a factor of 17 can be achieved with 5% relative reduction in equal error rate (EER) compared with the baseline. The SGMM-SBM also shows some advantages over the recently proposed hash GMM, including higher speed and better verification performance. EDICS: 1-SPEA Index Terms-Gaussian clustering, neural network, speaker verification, structural Gaussian mixture model. I. INTRODUCTION R ESEARCH on speaker recognition [1], including identification and verification, has been an active area for several decades. The goal is to have a machine automatically identify a particular person or verify a person's claimed identity from his/her voice. As one of the techniques in biometrics, speaker recognition can be used in many access control applications, such as network security, phone transactions, room access, etc. The speakers are divided into two groups, the enrolled target speakers and the nontarget speakers or background speakers. Both identification and verification can be classified into text-independent and text-dependent applications based on whether or Manuscript

show abstract

“…A 5 consistently outperforms measurements A 1 -A 4 . Compared to the baseline, A 5 exhibits improvement that is statistically sig- We have also implemented 4-fold model aggregation [16] for A 5 obtaining 22.9% error rate on the Core Test set. We then combined this classifier with 8 other classifiers defined over 8 segmental features described in [14] obtaining an error rate of 18.5% on the same set, which is an improvement over the 18.7% obtained without the waveletbased feature.…”

Section: Resultsmentioning

confidence: 99%

A Wavelet and Filter Bank Framework For Phonetic Classification

Choueiter¹,

Glass²

Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005.

View full text Add to dashboard Cite

In this paper, we present a wavelet and filter bank framework for context-independent phonetic classification with the aim of extending the work towards a full speech recognition system. The framework addresses the limitations of the Fourier analysis stage commonly used for short-time spectral representation of speech signals. Also, previous research pertaining to wavelet analysis for speech processing mostly makes use of off-the-shelf wavelets and dyadicbased signal decomposition. Our framework provides more flexibility by taking advantage of the relationship between wavelet transforms and filter banks, and using two filter design techniques as well as 'rational' wavelets. On the standard 39 phone TIMIT classification task, we achieve 22.9% error rate on the Core Test set using rational filter banks and 4-fold aggregation. This is improved to 18.5% when combined with multiple classifiers defined over non-wavelet acoustic measurements.

show abstract

Using aggregation to improve the performance of mixture Gaussian acoustic models

Cited by 10 publications

References 8 publications

Subword-based approaches for spoken document retrieval

Subword-based approaches for spoken document retrieval

Efficient text-independent speaker verification with structural gaussian mixture models and neural network

A Wavelet and Filter Bank Framework For Phonetic Classification

Contact Info

Product

Resources

About