Investigating lightly supervised acoustic model training

Lamel, Lori; Gauvain, Jean‐Luc; Adda, Gilles

doi:10.1109/icassp.2001.940871

Cited by 39 publications

(31 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Initial Unlabeled Data These self-training methods have been studied in GMM-based acoustic models [76,77,48,146,105,153]. In recent studies [136,45,60,82], self-training methods are also used in DNN-based acoustic model training.…”

Section: Initial Labeled Datamentioning

confidence: 99%

See 1 more Smart Citation

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Liu

Kirchhoff

2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Graph-based semi-supervised learning (SSL) is a widely used semi-supervised learning method in which the labeled data and unlabeled data are jointly represented as a weighted graph, and the information is propagated from the labeled data to the unlabeled data. The key assumption that graph-based SSL makes is that data samples lie on a low dimensional manifold, where samples that are close to each other are expected to have the same class label. More importantly, by exploiting the relationship between training and test samples, graph-based SSL implicitly adapts to the test data.In this thesis, we address several key challenges in applying graph-based SSL to acoustic modeling. We first investigate and compare several state-of-the-art graph-based SSL algorithms on a benchmark dataset. In addition, we propose novel graph construction methods that allow graph-based SSL to handle variable-length input features. We next investigate the efficacy of graph-based SSL in context of a fully-fledged DNN-based ASR system. We compare two different integration frameworks for graph-based learning. First, we propose a lattice-based late integration framework that combines graph-based SSL with the DNN-based acoustic modeling and evaluate the framework on continuous word recognition tasks. Second, we propose an early integration framework using neural graph embeddings and compare two different neural graph embedding features that capture the information of the manifold at different levels. The embedding features are used as input to a DNN system and are shown to outperform the conventional acoustic feature inputs on several medium-to-large vocabulary conversational speech recognition tasks.

show abstract

Section: Initial Labeled Datamentioning

confidence: 99%

“…Following the conventional self training [76,77,48,146,105,153] approach, we first train an initial DNN-HMM system using the training data, and decode on the test data. For data selection, we pick utterances with the highest average per-frame decoding likelihood and add them to the training data.…”

Section: Comparison To Self Trainingmentioning

confidence: 99%

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Liu

Kirchhoff

2016

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…A fair amount of past research has been devoted to improving the acoustic models from un-transcribed speech [5,6,7,8,9], and to adapt language models trained from out-of-domain text to the task at hand. Such methods of improving the LVCSR performance, which subsequently improve KWS performance, are not a focus of this paper.…”

Section: Low-resource Abstractearchmentioning

confidence: 99%

“…The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.an LVCSR system -such as 10 hours of transcribed speech corresponding to about 100K words of transcribed text, and a pronunciation lexicon that covers the words in the training data -but accuracy is sufficiently low that considerable improvement in KWS performance is necessary before the system is usable for searching a speech collection.A fair amount of past research has been devoted to improving the acoustic models from un-transcribed speech [5,6,7,8,9], and to adapt language models trained from out-of-domain text to the task at hand. Such methods of improving the LVCSR performance, which subsequently improve KWS performance, are not a focus of this paper.…”

mentioning

confidence: 99%

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Chen

Khudanpur

Povey

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. Stateof-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage -or reduced out-of-vocabulary (OOV) rate -leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage.Index Terms-Speech Recognition, Keyword Search, Information Retrieval, Morphology, Speech Synthesis LOW-RESOURCE KEYWORD SEARCHThanks in part to the falling costs of storage and transmission, large volumes of speech such as oral history archives [1, 2] and on-line lectures [3,4] are now easily accessible by large user populations via the world wide web. Unlike the text-web, however, searching speech using keywords continues to be a challenging problem. Manually transcribing the speech is often prohibitively expensive. Automatic keyword search (KWS) systems are able to address the problem in some cases, but not in others, because high performance KWS systems, in turn, rely on underlying large vocabulary continuous speech recognition (LVCSR) systems that are also expensive to develop. Good LVCSR systems utilize statistical acoustic-and language-models trained from large quantities of transcribed speech and "conversational" text in the search domain, and manually crafted pronunciation lexicons with good coverage of the collection.We are interested in improving KWS performance in a low resource setting, i.e. where some resources are available to developThe authors, listed here in alphabetical order, were supported by DARPA BOLT contract Nō HR0011-12-C-0015, and IARPA BABEL contract Nō W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.an LVCSR system -such as 10 hours of transcribed speech corresponding to about 100K words of transcribed text, and a pronunciation lexicon that covers the words in the training data -but accuracy is sufficiently low that considerable improvement in K...

show abstract

“…Several experiments have shown that it is possible to achieve reasonable performance using data with erroneous transcriptions [45,46,47]. But no significant work has been done to analyze why the training algorithms are robust to mislabeled transcriptions.…”

Section: Thesis Objective and Organizationmentioning

confidence: 99%

Effects on transcription errors on supervised learning in speech recognition

Sundaram¹

2004 IEEE International Conference on Acoustics, Speech, and Signal Processing

View full text Add to dashboard Cite

Supervised learning using Hidden Markov Models has been used to train acoustic models for automatic speech recognition for several years. Typically clean transcriptions form the basis for this training regimen. However, results have shown that using sources of readily available transcriptions, which can be erroneous at times (e.g., closed captions) do not degrade the performance significantly. This work analyzes the effects of mislabeled data on recognition accuracy. For this purpose, the training is performed using manually corrupted training data and the results are observed on three different databases: TIDigits, Alphadigits and SwitchBoard. For Alphadigits, with 16% of data mislabeled, the performance of the system degrades by 12% relative to the baseline results. For a complex task like SWITCHBOARD, at 16% mislabeled training data, the performance of the system degrades by 8.5% relative to the baseline results. The training process is more robust to mislabeled data because the Gaussian mixtures that are used to model the underlying distribution tend to cluster around the majority of the correct data. The outliers (incorrect data) do not contribute significantly to the reestimation process.

show abstract

Investigating lightly supervised acoustic model training

Cited by 39 publications

References 6 publications

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Graph-Based Semisupervised Learning for Acoustic Modeling in Automatic Speech Recognition

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Effects on transcription errors on supervised learning in speech recognition

Contact Info

Product

Resources

About