ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053569
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Abstract: Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
140
0
3

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 223 publications
(155 citation statements)
references
References 24 publications
(33 reference statements)
1
140
0
3
Order By: Relevance
“…Further work will allow to determine whether the superiority of time domain noise augmentation over spectral ones is specific to the CPC loss or to the fact that our architecture starts directly from the waveform as opposed to using spectral features like Mel Filterbanks or MFCCs. Note that [18] also combines several data augmentation techniques for unsupervised learning in an autoencoder architecture. Among data augmentation technique they use the most are two time-domain ones (reverberation and additive noise) and one spectral (band reject).…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Further work will allow to determine whether the superiority of time domain noise augmentation over spectral ones is specific to the CPC loss or to the fact that our architecture starts directly from the waveform as opposed to using spectral features like Mel Filterbanks or MFCCs. Note that [18] also combines several data augmentation techniques for unsupervised learning in an autoencoder architecture. Among data augmentation technique they use the most are two time-domain ones (reverberation and additive noise) and one spectral (band reject).…”
Section: Discussionmentioning
confidence: 99%
“…Our work is close to [18], which applies data augmentation techniques to representation learning (autoencoders). However, they evaluated them in terms of pretraining for a downstream task not in terms of the learned representation.…”
Section: Related Workmentioning
confidence: 99%
“…Such deep generative models offer different ways of addressing the problem of adaptation including powerful approaches to data augmentation, and the development of rich adaptation algorithms building on a base model with a joint distribution over acoustics and symbols. This offers the possibility of finetuning general encoders to specific acoustic domains, and adapting the decoder to specific tasks (such as speech recognition, speaker identification, language recognition, or emotion recognition), noting that classic adaptation to speakers can bring further gains [327], [328].…”
Section: Summary and Discussionmentioning
confidence: 99%
“…Self-supervised learning has been firstly adopted within the computer vision community to learn representations by solving various auxiliary tasks, such as colorize gray scale images or solving puzzles from image patches. Self-supervised learning has also been applied successfully in language modeling, leading to models like BERT [25]. In this problem, the self-supervised learning model can increase the data capacity very well, so that the model can be trained using unlabeled data.…”
Section: The Self-supervised Sleep Recognition Modelmentioning
confidence: 99%