Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.580
|View full text |Cite
|
Sign up to set email alerts
|

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Abstract: Transfer learning has proven to be crucial in advancing the state of speech and natural language processing research in recent years. In speech, a model pre-trained by self-supervised learning transfers remarkably well on multiple tasks. However, the lack of a consistent evaluation methodology is limiting towards a holistic understanding of the efficacy of such models. SUPERB was a step towards introducing a common benchmark to evaluate pretrained models across various speech tasks. In this paper, we introduce… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 38 publications
(14 citation statements)
references
References 31 publications
0
11
0
Order By: Relevance
“…Self-supervised models have become a nearly ubiquitous approach for learning speech representations and improving performance on downstream tasks [1][2][3][4][5], but our understanding of their properties and strategies for their use is still limited. Some recent work has begun developing an understanding of the extent and location of different acoustic and linguistic information encoded by these models [6][7][8][9][10], which in some cases has resulted in improved fine-tuning strategies [8,9].…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Self-supervised models have become a nearly ubiquitous approach for learning speech representations and improving performance on downstream tasks [1][2][3][4][5], but our understanding of their properties and strategies for their use is still limited. Some recent work has begun developing an understanding of the extent and location of different acoustic and linguistic information encoded by these models [6][7][8][9][10], which in some cases has resulted in improved fine-tuning strategies [8,9].…”
Section: Introductionmentioning
confidence: 99%
“…xlsr53 is trained on spoken data from 53 languages. For the audio-visual models, avhubert, fastvgs 5 and fastvgs+, we use the audio branch alone, as our analyses use only speech input. fastvgs's audio branch uses the 7 CNN and the first 8 transformer layers from w2v2-small, and the transformer layers are trained with a cross-modal contrastive loss along with the rest of the network.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The lip image sequence V 1:T and noisy speech A n 1:T are fed into the AV-HuBERT; the representations from each layer of the transformer encoder are denoted as H l , where 0 ≤ l ≤ N , and N is the number of layers. Inspired by [18,12], a trainable function w(•) is applied to the representations from all layers as follows:…”
Section: Audio-visual Se Modelmentioning
confidence: 99%
“…While not as expansive in terms of task evaluations as those available in text processing, to provide a more robust measure of speech processing performance, the Speech processing Universal PERformance Benchmark (SUPERB) was released in 2021 containing 10 tasks such as speaker identification, keyword spotting, speaker diarization (separating speakers in a single audio stream), and speech recognition (Yang et al., 2021). This benchmark was extended by SUPERB‐SG in 2022 with increased diversity and difficulty of tasks such as speech translation, voice conversion (convert speech from an arbitrary speaker into a target speaker such as a celebrity), and speech enhancement (Tsai et al., 2022). While some of these tasks are hard to measure human performance in, after all not many people can convincingly imitate any given target speaker, to do well at these diverse tasks helps force models to excel at speech processing in general, which is the ultimate goal for AI.…”
mentioning
confidence: 99%