ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054224
|View full text |Cite
|
Sign up to set email alerts
|

Effectiveness of Self-Supervised Pre-Training for ASR

Abstract: We present pre-training approaches for selfsupervised representation learning of speech data. A BERT, masked language model, loss on discrete features is compared with an InfoNCE-based constrastive loss on continuous speech features. The pre-trained models are then fine-tuned with a Connectionist Temporal Classification (CTC) loss to predict target character sequences. To study impact of stacking multiple feature learning modules trained using different self-supervised loss functions, we test the discrete and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
97
0
2

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 129 publications
(130 citation statements)
references
References 33 publications
0
97
0
2
Order By: Relevance
“…LRSpeech aims for industrial deployment under two constraints: 1) extremely low data collection cost, and 2) high accuracy to satisfy the deployment requirement. For the first constraint, as the extremely low-resource setting shown in Table 1, LRSpeech explores the limits of data requirements by 2 Although we can crawl the multi-speaker low-quality unpaired speech data from the web, it is hard to crawl the single-speaker high-quality unpaired speech data. Therefore, it has the same collection cost (recorded by human) with the single-speaker high-quality paired data.…”
Section: Our Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…LRSpeech aims for industrial deployment under two constraints: 1) extremely low data collection cost, and 2) high accuracy to satisfy the deployment requirement. For the first constraint, as the extremely low-resource setting shown in Table 1, LRSpeech explores the limits of data requirements by 2 Although we can crawl the multi-speaker low-quality unpaired speech data from the web, it is hard to crawl the single-speaker high-quality unpaired speech data. Therefore, it has the same collection cost (recorded by human) with the single-speaker high-quality paired data.…”
Section: Our Methodsmentioning
confidence: 99%
“…Optionally, unpaired speech and text data can be leveraged. • In the low-resource setting, the single-speaker high-quality paired data are reduced to dozens of minutes in TTS [2,12,23,31] while the multi-speaker low-quality paired data is reduced to dozens of hours in ASR [16,32,33,39], compared to that in the richresource setting. Additionally, they leverage unpaired speech and text data to ensure the performance.…”
Section: Related Workmentioning
confidence: 99%
“…They compare the distance of segments of speech that belong to the same phonemes to those of segments of speech belonging to different phonemes. In the low resource setting, features are viewed as pretraining and are evaluated as their ability to transfer to some downstream task like phone or word recognition [4,7,31,32,33]. Typically, studies belong either to one class or the other, making it difficult to know whether the two kinds of metric correlate.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, self-supervised representation learning approaches have been proposed for ASR pre-training with competitive performance especially with a very small amount of labeled data [16,17,19,20,21,22,18,23]. Given unlabeled audio data, selfsupervised methods rely primarily on pretext pre-training tasks that acquire their labels either from the input signal itself (e.g., predicting future frames [20]), or via unsupervised means (e.g.…”
Section: Related Workmentioning
confidence: 99%