BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Park, Daniel; Han, Wei; Qin, James; Gulati, Anmol; Shor, Joel; Jansen, Aren; Xu, Yuanzhong; Wang, Shibo; Zhou, Zongwei; Li, Bo; Ma, Min; Chan, William; Yu, Jiahui; Wang, Yongqiang; Cao, Liangliang; Sim, Khe Chai; Ramabhadran, Bhuvana; Sainath, Tara N.; Beaufays, Françoise; Chen, Zhifeng; Le, Quoc V.; Chiu, Chung‐Cheng; Pang, Ruoming; Wu, Yonghui

doi:10.1109/jstsp.2022.3182537

Cited by 79 publications

(31 citation statements)

References 139 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The largest system, P7, achieved the WERs of 13.7% and 15.5% for the development and evaluation sets, respectively. To the best of our knowledge, these results represent the SOTA WERs for the AMI distant microphone setting by significantly outperforming previously reported results [10,25,36] while retaining the streaming inference capability.…”

Section: Evaluation Resultssupporting

confidence: 60%

“…The linguistic characteristics are also complex due to frequent turn-takings. Given these difficulties, most studies on DCSR have been conducted based on strong prerequisites such as the availability of utterance-level ground-truth segmentations (e.g., [9,10]) or offline inference (e.g., [11][12][13]). To advance the DCSR, innovations in both front-end signal processing and back-end ASR, as well as their efficient integration, would be needed.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Kanda¹,

Wu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.

show abstract

Section: Evaluation Resultssupporting

confidence: 60%

Section: Introductionmentioning

confidence: 99%

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Kanda¹,

Wu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Instead of proposing a new method, investigating the impact of data augmentation is worth studying. For example, reference [3] found that the key finding of their research building self-supervised learning automatic speech recognition is the extremely large and diverse datasets. However, for speech emotion recognition, the availability of the dataset is not as large as speech recognition datasets.…”

Section: Introductionmentioning

confidence: 99%

Effects of Data Augmentations on Speech Emotion Recognition

Atmaja¹

2022

Preprint

View full text Add to dashboard Cite

Data augmentation techniques recently gained more adoption in speech processing, including speech emotion recognition. Although more data tends to be more effective, there may be a trade-off in which more data will not provide a better model. This paper reports experiments on investigating the effects of data augmentation in speech emotion recognition. The investigation aims at finding the most useful type of data augmentation and the number of data augmentations for speech emotion recognition. The experiments are conducted on the Japanese Twitter-based emotional speech corpus. The results show that for speaker-independent data, two data augmentations with glottal source extraction and silence removal exhibited the best performance among others, even with more data augmentation techniques. For the text-independent data (including speaker and text-independent), more data augmentations tend to improve speech emotion recognition performances. The results highlight the trade-off between the number of data augmentation and the performance of speech emotion recognition showing the necessity to choose a proper data augmentation technique for a specific application.

show abstract

“…In this paper, we explored knowledge distillation for the RNN-T [7] model. RNN-T is widely used in large-scale ASR systems [8,9,10] and achieves state-of-the-art results on the Lib-riSpeech dataset [11,12,13]. NST training of RNN-T models was first studied in [6] using hard target distillation [4,14], where the student model is trained using pseudo labels generated by a teacher model.…”

Section: Introductionmentioning

confidence: 99%

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Hwang¹,

Sim²,

Strohman³

2022

Preprint

View full text Add to dashboard Cite

Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-ofthe-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scale RNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard targets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with 0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.

show abstract

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Cited by 79 publications

References 139 publications

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Effects of Data Augmentations on Speech Emotion Recognition

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Contact Info

Product

Resources

About