Generating exact lattices in the WFST framework

Povey, Daniel; Hannemann, Mirko; Boulianne, Gilles; Burget, Lukáš; Ghoshal, Arnab; Janda, Miloš; Karafiát, Martin; Kombrink, Stefan; Motlíček, Petr; Ye, Qian; Riedhammer, Korbinian; Veselý, Karel; Vu, Ngoc Thang

doi:10.1109/icassp.2012.6288848

Cited by 119 publications

(115 citation statements)

References 3 publications

(7 reference statements)

Supporting

Mentioning

110

Contrasting

Order By: Relevance

“…An overview about acoustic models based on deep neural networks can be found in [57,55]. However, in this thesis we employ the traditional HMM acoustic models with more recent techniques that Kaldi [105] speech recognition toolkit provides.…”

Section: Acoustic Modelmentioning

confidence: 99%

“…The baseline ASR system is built by using the Kaldi [105] speech recognition toolkit. The language model that the baseline system uses is the baseline tri-gram back-off model for 20K open vocabulary for non-verbalized punctuation that is also available in the corpus.…”

Section: Asr Baselinementioning

confidence: 99%

“…The ASR baseline for LUNA HH is constructed by using the Kaldi [105] speech recognition toolkit. The ASR uses mel-frequency cepstral coefficients (MFCC) that are transformed by linear discriminant analysis (LDA) and maximum likelihood linear transform (MLLT).…”

Section: Asr Baselinementioning

confidence: 99%

See 2 more Smart Citations

Semantic language models with deep neural networks

Bayer

Riccardi

2016

Computer Speech & Language

View full text Add to dashboard Cite

Spoken language systems (SLS) communicate with users in natural language through speech. There are two main problems related to processing the spoken input in SLS. The first one is automatic speech recognition (ASR) which recognizes what the user says. The second one is spoken language understanding (SLU) which understands what the user means. We focus on the language model (LM) component of SLS. LMs constrain the search space that is used in the search for the best hypothesis. Therefore, they play a crucial role in the performance of SLS.It has long been discussed that an improvement in the recognition performance does not necessarily yield a better understanding performance. Therefore, optimization of LMs for the understanding performance is crucial. In addition, long-range dependencies in languages are hard to handle with statistical language models. These two problems are addressed in this thesis.We investigate two different LM structures. The first LM that we investigate enable SLS to understand better what they recognize by searching the ASR hypotheses for the best understanding performance. We refer to these models as joint LMs. They use lexical and semantic units jointly in the LM. The second LM structure uses the semantic context of an utterance, which can also be described as "what the system understands", to search for a better hypothesis that improves the recognition and the understanding performance. We refer to these models as semantic LMs (SELMs). SELMs use features that are based on a well established theory of lexical semantics, namely the theory of frame semantics. They incorporate the semantic features which are extracted from the ASR hypothesis into the LM and handle long-range dependencies by using the semantic relationships between words and semantic context. ASR noise is propagated to the semantic features, to suppress this noise we introduce the use of deep semantic encodings for semantic feature extraction. In this way, SELMs optimize both the recognition and the understanding performance.

show abstract

Section: Acoustic Modelmentioning

confidence: 99%

Section: Asr Baselinementioning

confidence: 99%

See 1 more Smart Citation

Semantic language models with deep neural networks

Bayer

Riccardi

2016

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…Added noise sources are typically non-stationary (e.g., other speakers' utterances, home noises, or music). We used Kaldi toolkit [30] for the experiments.…”

Section: Task Descriptionmentioning

confidence: 99%

Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Tachioka

Watanabe

Roux

et al. 2017

Journal of Information Processing

View full text Add to dashboard Cite

Reverberant and noisy automatic speech recognition (ASR) using distant stereo microphones is a very challenging, but desirable scenario for home-environment speech applications. This scenario can often provide prior knowledge such as physical information about the sound sources and the environment in advance, which may then be used to reduce the influence of the interference. We propose a method to enhance the binary masking algorithm by using prior distributions of the time difference of arrival. This paper also validates state-of-the-art ASR techniques that include various discriminative training and feature transformation methods. Furthermore, we develop an efficient method to combine discriminative language modeling and minimum Bayes risk decoding in the ASR post-processing stage. We also investigate the effectiveness of this method when used for reverberated and noisy ASR of deep neural networks (DNNs) as well when used in systems that combine multiple DNNs using different features. Experiments on the medium vocabulary sub-task of the second CHiME challenge show that the system submitted to the challenge achieved a 26.86% word error rate (WER), moreover, the DNN system with the discriminative training, speaker adaptation and system combination achieves a 20.40% WER.

show abstract

“…The Kaldi decoder generates word lattices [17] for the eval data using the GMM+SAT, SGMM and SGMM+BMMI models. The decoding lexicon is varied systematically, from the low resource lexicon of 5.7K words (8.9K pronunciations), through automatically augmented lexicons of three different sizes, to the full Babel reference lexicon of 23K words (35K pronunciations).…”

Section: Kaldi-based Lvcsr System Descriptionmentioning

confidence: 99%

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Chen

Khudanpur

Povey

et al. 2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

This paper quantifies the value of pronunciation lexicons in large vocabulary continuous speech recognition (LVCSR) systems that support keyword search (KWS) in low resource languages. Stateof-the-art LVCSR and KWS systems are developed for conversational telephone speech in Tagalog, and the baseline lexicon is augmented via three different grapheme-to-phoneme models that yield increasing coverage of a large Tagalog word-list. It is demonstrated that while the increased lexical coverage -or reduced out-of-vocabulary (OOV) rate -leads to only modest (ca 1%-4%) improvements in word error rate, the concomitant improvements in actual term weighted value are as much as 60%. It is also shown that incorporating the augmented lexicons into the LVCSR system before indexing speech is superior to using them post facto, e.g., for approximate phonetic matching of OOV keywords in pre-indexed lattices. These results underscore the disproportionate importance of automatic lexicon augmentation for KWS in morphologically rich languages, and advocate for using them early in the LVCSR stage.Index Terms-Speech Recognition, Keyword Search, Information Retrieval, Morphology, Speech Synthesis LOW-RESOURCE KEYWORD SEARCHThanks in part to the falling costs of storage and transmission, large volumes of speech such as oral history archives [1, 2] and on-line lectures [3,4] are now easily accessible by large user populations via the world wide web. Unlike the text-web, however, searching speech using keywords continues to be a challenging problem. Manually transcribing the speech is often prohibitively expensive. Automatic keyword search (KWS) systems are able to address the problem in some cases, but not in others, because high performance KWS systems, in turn, rely on underlying large vocabulary continuous speech recognition (LVCSR) systems that are also expensive to develop. Good LVCSR systems utilize statistical acoustic-and language-models trained from large quantities of transcribed speech and "conversational" text in the search domain, and manually crafted pronunciation lexicons with good coverage of the collection.We are interested in improving KWS performance in a low resource setting, i.e. where some resources are available to developThe authors, listed here in alphabetical order, were supported by DARPA BOLT contract Nō HR0011-12-C-0015, and IARPA BABEL contract Nō W911NF-12-C-0015. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA, IARPA, DoD/ARL or the U.S. Government.an LVCSR system -such as 10 hours of transcribed speech corresponding to about 100K words of transcribed text, and a pronunciation lexicon that covers the words in the training data -but accuracy is sufficiently low that considerable improvement in K...

show abstract

Generating exact lattices in the WFST framework

Cited by 119 publications

References 3 publications

Semantic language models with deep neural networks

Semantic language models with deep neural networks

Prior-based Binary Masking and Discriminative Methods for Reverberant and Noisy Speech Recognition Using Distant Stereo Microphones

Quantifying the value of pronunciation lexicons for keyword search in lowresource languages

Contact Info

Product

Resources

About