Acoustic modelling with CD-CTC-SMBR LSTM RNNS

Andrew, Andrew; Sak, Haşim; Quitry, Félix de Chaumont; Sainath, Tara N.; Rao, Kanishka

doi:10.1109/asru.2015.7404851

Cited by 36 publications

(6 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Traditional hybrid DNN/Hidden Markov Model (HMM) approach utilizes a neural network to produce a posterior distribution over tied HMM states [14,15] for each acoustic frame, usually followed by sequence discriminative training to boost performance [16]. CTC [17] has became an alternative criterion to frame-level cross-entropy (CE) training or sequencelevel lattice-free MMI (LF-MMI) training in recent years and has shown promising results [18][19][20][21][22]. Inspired by the rise of end-to-end training in machine translation, encoder-decoder architecture was also introduced for ASR, e.g.…”

Section: Introductionmentioning

confidence: 99%

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Zhang¹,

Wang²,

Zhang³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-ofthe-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

show abstract

Section: Introductionmentioning

confidence: 99%

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Zhang¹,

Wang²,

Zhang³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…As for end-to-end models for child ASR, Andrew et al (2015) show improvement on child speech with a CTCbased system jointly trained on very large quantities of mixed adult and child speech data. Usage of seq2seq models for child speech recognition is a new research subject, as show the extremely recent communication of technical reports on this matter (Ng et al, 2020;Chen et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

End-to-end acoustic modelling for phone recognition of young readers

Gelin¹,

Daniel²,

Pinquier³

et al. 2021

Preprint

View full text Add to dashboard Cite

Automatic recognition systems for child speech are lagging behind those dedicated to adult speech in the race of performance. This phenomenon is due to the high acoustic and linguistic variability present in child speech caused by their body development, as well as the lack of available child speech data. Young readers speech additionally displays peculiarities, such as slow reading rate and presence of reading mistakes, that hardens the task. This work attempts to tackle the main challenges in phone acoustic modelling for young child speech with limited data, and improve understanding of strengths and weaknesses of a wide selection of model architectures in this domain. We find that transfer learning techniques are highly efficient on end-to-end architectures for adult-to-child adaptation with a small amount of child speech data. Through transfer learning, a Transformer model complemented with a Connectionist Temporal Classification (CTC) objective function, reaches a phone error rate of 28.1%, outperforming a state-of-the-art DNN-HMM model by 6.6% relative, as well as other end-toend architectures by more than 8.5% relative. An analysis of the models' performance on two specific reading tasks (isolated words and sentences) is provided, showing the influence of the utterance length on attention-based and CTC-based models. The Transformer+CTC model displays an ability to better detect reading mistakes made by children, that can be attributed to the CTC objective function effectively constraining the attention mechanisms to be monotonic.

show abstract

“…They do not necessarily lead to minimized recognition error rate in LVCSR tasks. Therefore, many discriminatvie training methods such as MCE [25], maximum mutual information (MMI) [26,27], minimum phone error (MPE) [28], state-level minimum Bayes risk (sMBR) [29,30] and boosted MMI [31] are proposed to further refine the DNN [32] and LSTM [33] acoustic model. For the keyword spotting task based on LVCSR, our goal is to minimize the recognition error on the set of keywords, while the aforementioned methods focus on the minimization of recognition error rate on all possible words which are not suitable for the keyword spotting task.…”

Section: Non-uniform Bmce Training Of Deep Blstm Acoustic Model For Kmentioning

confidence: 99%

Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting

Meng¹,

Juang²

2017

Interspeech 2017

View full text Add to dashboard Cite

It has been shown in [1,2] that improved performance can be achieved by formulating the keyword spotting as a non-uniform error automatic speech recognition problem. In this work, we discriminatively train a deep bidirectional long short-term memory (BLSTM) -hidden Markov model (HMM) based acoustic model with non-uniform boosted minimum classification error (BMCE) criterion which imposes more significant error cost on the keywords than those on the non-keywords. By introducing the BLSTM, the context information in both the past and the future are stored and updated to predict the desired output and the long-term dependencies within the speech signal are well captured. With non-uniform BMCE objective, the BLSTM is trained so that the recognition errors related to the keywords are remarkably reduced. The BLSTM is optimized using backpropagation through time and stochastic gradient descent. The keyword spotting system is implemented within weighted finite state transducer framework. The proposed method achieves 5.49% and 7.37% absolute figure-of-merit improvements respectively over the BLSTM and the feedforward deep neural network baseline systems trained with cross-entropy criterion for the keyword spotting task on Switchboard-1 Release 2 dataset.

show abstract

Acoustic modelling with CD-CTC-SMBR LSTM RNNS

Cited by 36 publications

References 8 publications

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

End-to-end acoustic modelling for phone recognition of young readers

Non-Uniform MCE Training of Deep Long Short-Term Memory Recurrent Neural Networks for Keyword Spotting

Contact Info

Product

Resources

About