2021
DOI: 10.48550/arxiv.2105.01051
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

SUPERB: Speech processing Universal PERformance Benchmark

Abstract: Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to bench… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
37
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 30 publications
(38 citation statements)
references
References 38 publications
1
37
0
Order By: Relevance
“…For the three benchmarks Speech Commands V1, Vox-Celeb1, and IEMOCAP that the original AST has not been tested on, we use the standard SUPERB (Yang et al 2021) training and testing framework. Specifically, we search the learning rate from 1e-5 to 1e-3 for out SSAST model and all baseline models and train the model for up to 20k, 40k, and 10k iterations for Speech Commands V2, VocCeleb1, and IEMOCAP, respectively.…”
Section: Downstream Fine-tuning Detailsmentioning
confidence: 99%
“…For the three benchmarks Speech Commands V1, Vox-Celeb1, and IEMOCAP that the original AST has not been tested on, we use the standard SUPERB (Yang et al 2021) training and testing framework. Specifically, we search the learning rate from 1e-5 to 1e-3 for out SSAST model and all baseline models and train the model for up to 20k, 40k, and 10k iterations for Speech Commands V2, VocCeleb1, and IEMOCAP, respectively.…”
Section: Downstream Fine-tuning Detailsmentioning
confidence: 99%
“…HuBERT is a self-supervised learning based model trained with masked continuous audio signals. It shows superior performance across multiple tasks such as ASR, spoken language modeling, and speech synthesis [20]. For the HuBERT model, we use the k-means algorithm to cluster the extracted latent representations into the clustering center while the VQ-VAE directly use the nearest centroid in VQ-codebook .…”
Section: Discretization Of Speechmentioning
confidence: 99%
“…In preliminary experiments, VQ-VAE was used with the same architecture proposed by [25] for the VCTK corpus. Following that, inspired by the impressive performance with HuBERT discrete [20], we also use the HiFi-GAN to synthesize the speech signal given the HuBERT discrete symbols from the pre-trained model [29]. For all experiments, we use the HuBERT Large model trained on Libri-Light [30] 60k hour without any downstream finetuning to extract the features.…”
Section: Discrete Vocodermentioning
confidence: 99%
“…With the popularization of self-supervised learning (SSL), the speech community has made significant progress towards reducing the amount of labelled data needed to reach strong levels of performance in a variety of tasks. To better measure the progress within this research area, two benchmarks, the ZeroSpeech 2021 challenge (Nguyen et al 2020;Alishahi et al 2021) and the SUPERB benchmark (Yang et al 2021), were recently proposed.…”
Section: Introductionmentioning
confidence: 99%