Ching-Feng Yeh scite author profile

In this paper, we propose a domain adversarial training (DAT) algorithm to alleviate the accented speech recognition problem. In order to reduce the mismatch between labeled source domain data ("standard" accent) and unlabeled target domain data (with heavy accents), we augment the learning objective for a Kaldi TDNN network with a domain adversarial training (DAT) objective to encourage the model to learn accentinvariant features. In experiments with three Mandarin accents, we show that DAT yields up to 7.45% relative character error rate reduction when we do not have transcriptions of the accented speech, compared with the baseline trained on standard accent data only. We also find a benefit from DAT when used in combination with training from automatic transcriptions on the accented data. Furthermore, we find that DAT is superior to multi-task learning for accented speech recognition.Index Terms-Domain adaptation, accent robust speech recognition, domain adversarial training * Work performed as an intern at Mobvoi AI Lab and University of Washington.† Lei Xie is the corresponding author.

show abstract

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Yeh¹,

Mahadeokar²,

Kalgaonkar³

et al. 2019

Preprint

View full text Add to dashboard Cite

Alignment Restricted Streaming Recurrent Neural Network Transducer

Mahadeokar

Shangguan

et al. 2021

View full text Add to dashboard Cite

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

Shi

Wang

et al. 2021

View full text Add to dashboard Cite

This paper proposes an efficient memory transformer Emformer for low latency streaming speech recognition. In Emformer, the longrange history context is distilled into an augmented memory bank to reduce self-attention's computation complexity. A cache mechanism saves the computation for the key and value in self-attention for the left context. Emformer applies a parallelized block processing in training to support low latency models. We carry out experiments on benchmark LibriSpeech data. Under average latency of 960 ms, Emformer gets WER 2.50% on test-clean and 5.62% on test-other. Comparing with a strong baseline augmented memory transformer (AM-TRF), Emformer gets 4.6 folds training speedup and 18% relative real-time factor (RTF) reduction in decoding with relative WER reduction 17% on test-clean and 9% on test-other. For a low latency scenario with an average latency of 80 ms, Emformer achieves WER 3.01% on test-clean and 7.09% on test-other. Comparing with the LSTM baseline with the same latency and model size, Emformer gets relative WER reduction 9% and 16% on test-clean and testother, respectively.

show abstract

Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Wang

Shi

et al. 2020

View full text Add to dashboard Cite

Transformer-based acoustic modeling has achieved great success for both hybrid and sequence-to-sequence speech recognition. However, it requires access to the full sequence, and the computational cost grows quadratically with respect to the input sequence length. These factors limit its adoption for streaming applications. In this work, we proposed a novel augmented memory self-attention, which attends on a short segment of the input sequence and a bank of memories. The memory bank stores the embedding information for all the processed segments. On the librispeech benchmark, our proposed method outperforms all the existing streamable transformer methods by a large margin and achieved over 15% relative error reduction, compared with the widely used LC-BLSTM baseline. Our findings are also confirmed on some large internal datasets.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ching-Feng Yeh

Domain Adversarial Training for Accented Speech Recognition

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

Alignment Restricted Streaming Recurrent Neural Network Transducer

Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition

Streaming Transformer-Based Acoustic Models Using Self-Attention with Augmented Memory

Contact Info

Product

Resources

About