Chengzhu Yu scite author profile

This paper advances the design of CTC-based all-neural (or end-toend) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks.We evaluate our system on the commonly used NIST 2000 conversational telephony test set, and significantly exceed the previously published performance of similar systems, both with and without the use of an external language model and decoding technology.

show abstract

An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing

Zhang

Hansen

2017

IEEE J. Sel. Top. Signal Process.

107

View full text Add to dashboard Cite

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition

Weng

Cui

Wang

et al. 2018

View full text Add to dashboard Cite

In this work, we propose two improvements to attention based sequence-to-sequence models for end-to-end speech recognition systems. For the first improvement, we propose to use an input-feeding architecture which feeds not only the previous context vector but also the previous decoder hidden state information as inputs to the decoder. The second improvement is based on a better hypothesis generation scheme for sequential minimum Bayes risk (MBR) training of sequence-to-sequence models where we introduce softmax smoothing into N-best generation during MBR training. We conduct the experiments on both Switchboard-300hrs and Switchboard+Fisher-2000hrs datasets and observe significant gains from both proposed improvements. Together with other training strategies such as dropout and scheduled sampling, our best model achieved WERs of 8.3%/15.5% on the Switchboard/CallHome subsets of Eval2000 without any external language models which is highly competitive among state-of-the-art English conversational speech recognition systems. Index Terms: attention based sequence-to-sequence models, end-to-end speech recognition, sequential minimum Bayes risk training, MBR

show abstract

Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

Weng

Cui

et al. 2020

View full text Add to dashboard Cite

In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-thefly generated N-best hypothesis. We also introduce a heuristic to incorporate an external neural network language model (NNLM) in RNN-T beam search decoding and explore MBR training with the external NNLM. Experimental results demonstrate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM. Our best MBR trained system achieves absolute character error rate (CER) reductions of 1.2% and 0.5% on read and spontaneous Mandarin speech respectively over a strong convolution and transformer based RNN-T baseline trained on ∼21,000 hours of speech.

show abstract

Uncertainty propagation in front end factor analysis for noise robust speaker recognition

Liu

Hahm

et al. 2014

View full text Add to dashboard Cite

Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions

Delcroix

Kinoshita

et al. 2016

View full text Add to dashboard Cite

12 3 4 5 6

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Chengzhu Yu

The NTT CHiME-3 system: Advances in speech enhancement and recognition for mobile multi-microphone devices

DurIAN: Duration Informed Attention Network for Speech Synthesis

Advances in all-neural speech recognition

An Investigation of Deep-Learning Frameworks for Speaker Verification Antispoofing

Improving Attention Based Sequence-to-Sequence Models for End-to-End English Conversational Speech Recognition

Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition

Uncertainty propagation in front end factor analysis for noise robust speaker recognition

Context adaptive deep neural networks for fast acoustic model adaptation in noisy conditions

Contact Info

Product

Resources

About