2022
DOI: 10.1109/jstsp.2022.3182537
|View full text |Cite
|
Sign up to set email alerts
|

BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Abstract: We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR an… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
30
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 79 publications
(31 citation statements)
references
References 139 publications
1
30
0
Order By: Relevance
“…The largest system, P7, achieved the WERs of 13.7% and 15.5% for the development and evaluation sets, respectively. To the best of our knowledge, these results represent the SOTA WERs for the AMI distant microphone setting by significantly outperforming previously reported results [10,25,36] while retaining the streaming inference capability.…”
Section: Evaluation Resultssupporting
confidence: 60%
See 1 more Smart Citation
“…The largest system, P7, achieved the WERs of 13.7% and 15.5% for the development and evaluation sets, respectively. To the best of our knowledge, these results represent the SOTA WERs for the AMI distant microphone setting by significantly outperforming previously reported results [10,25,36] while retaining the streaming inference capability.…”
Section: Evaluation Resultssupporting
confidence: 60%
“…The linguistic characteristics are also complex due to frequent turn-takings. Given these difficulties, most studies on DCSR have been conducted based on strong prerequisites such as the availability of utterance-level ground-truth segmentations (e.g., [9,10]) or offline inference (e.g., [11][12][13]). To advance the DCSR, innovations in both front-end signal processing and back-end ASR, as well as their efficient integration, would be needed.…”
Section: Introductionmentioning
confidence: 99%
“…Instead of proposing a new method, investigating the impact of data augmentation is worth studying. For example, reference [3] found that the key finding of their research building self-supervised learning automatic speech recognition is the extremely large and diverse datasets. However, for speech emotion recognition, the availability of the dataset is not as large as speech recognition datasets.…”
Section: Introductionmentioning
confidence: 99%
“…In this paper, we explored knowledge distillation for the RNN-T [7] model. RNN-T is widely used in large-scale ASR systems [8,9,10] and achieves state-of-the-art results on the Lib-riSpeech dataset [11,12,13]. NST training of RNN-T models was first studied in [6] using hard target distillation [4,14], where the student model is trained using pseudo labels generated by a teacher model.…”
Section: Introductionmentioning
confidence: 99%