SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Tsai, Hsiang-Sheng; Chang, Heng-Jui; Huang, Wen-Chin; Huang, Zili; Lakhotia, Kushal; Yang, Shuwen; Dong, Shuyan; Liu, Andy; Lai, Cheng-I; Shi, Jiatong; Chang, Xuankai; Hall, Phil; Chen, Hsuan-Jui; Li, Shang-Wen; Watanabe, Shinji; Mohamed, Abdelrahman; Lee, Hung-yi

doi:10.18653/v1/2022.acl-long.580

Cited by 38 publications

(14 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Self-supervised models have become a nearly ubiquitous approach for learning speech representations and improving performance on downstream tasks [1][2][3][4][5], but our understanding of their properties and strategies for their use is still limited. Some recent work has begun developing an understanding of the extent and location of different acoustic and linguistic information encoded by these models [6][7][8][9][10], which in some cases has resulted in improved fine-tuning strategies [8,9].…”

Section: Introductionmentioning

confidence: 99%

“…xlsr53 is trained on spoken data from 53 languages. For the audio-visual models, avhubert, fastvgs 5 and fastvgs+, we use the audio branch alone, as our analyses use only speech input. fastvgs's audio branch uses the 7 CNN and the first 8 transformer layers from w2v2-small, and the transformer layers are trained with a cross-modal contrastive loss along with the rest of the network.…”

Section: Introductionmentioning

confidence: 99%

“…the w2v2 and hubert small models are pre-trained on 960 hours Librispeech, and the corresponding large models on 60k hours LibriLight data 3. the wavlm-small is pre-trained on 960 hours LibriSpeech and wavlmlarge on 94k hours constituting LibriLight, GigaSpeech, and VoxPopuli4 We use avhubert models pre-trained on LRS35 We use fastvgs and fastvgs+ models pre-trained on SpokenCOCO…”

mentioning

confidence: 99%

See 2 more Smart Citations

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Pasad

Chou

Livescu

2021

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

116

View full text Add to dashboard Cite

Many self-supervised speech models, varying in their pretraining objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive empirical successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models. 1

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Pasad

Chou

Livescu

2021

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

116

View full text Add to dashboard Cite

show abstract

“…The lip image sequence V 1:T and noisy speech A n 1:T are fed into the AV-HuBERT; the representations from each layer of the transformer encoder are denoted as H l , where 0 ≤ l ≤ N , and N is the number of layers. Inspired by [18,12], a trainable function w(•) is applied to the representations from all layers as follows:…”

Section: Audio-visual Se Modelmentioning

confidence: 99%

Audio-Visual Speech Enhancement and Separation by Leveraging Multi-Modal Self-Supervised Embeddings

Chern¹,

Hung²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be obtained via utilizing multi-modal self-supervised embeddings. Nevertheless, it is unclear if such representations can be generalized to solve real-world multi-modal AV regression tasks, such as audio-visual speech enhancement (AVSE) and audio-visual speech separation (AVSS). In this study, we leveraged the pre-trained AV-HuBERT model followed by an SE module for AVSE and AVSS. Comparative experimental results demonstrate that our proposed model performs better than the state-of-the-art AVSE and traditional audio-only SE models. In summary, our results confirm the effectiveness of our proposed model for the AVSS task with proper fine-tuning strategies, demonstrating that multi-modal selfsupervised embeddings obtained from AV-HUBERT can be generalized to audio-visual regression tasks.

show abstract

“…While not as expansive in terms of task evaluations as those available in text processing, to provide a more robust measure of speech processing performance, the Speech processing Universal PERformance Benchmark (SUPERB) was released in 2021 containing 10 tasks such as speaker identification, keyword spotting, speaker diarization (separating speakers in a single audio stream), and speech recognition (Yang et al., 2021). This benchmark was extended by SUPERB‐SG in 2022 with increased diversity and difficulty of tasks such as speech translation, voice conversion (convert speech from an arbitrary speaker into a target speaker such as a celebrity), and speech enhancement (Tsai et al., 2022). While some of these tasks are hard to measure human performance in, after all not many people can convincingly imitate any given target speaker, to do well at these diverse tasks helps force models to excel at speech processing in general, which is the ultimate goal for AI.…”

mentioning

confidence: 99%

Is AI at human parity yet? A case study on speech recognition

Beaver

2022

AI Magazine

View full text Add to dashboard Cite

Claims have been made that speech recognition has achieved human parity, yet this does not appear to be the case in the real‐world applications that rely on it, especially for non‐native speakers. This then begs the questions: What does it even mean for an AI system to reach human parity? How is progress towards that goal being measured? This article focuses on the current state of speech recognition and the recent developments in benchmarking and measuring performance of AI models built for speech processing. Through the shift away from single metric benchmarks and specialized models and towards evaluating collections of diverse challenging tasks and generalized models, the ultimate goal of true human parity in commercial speech processing applications is hopefully on the near horizon.

show abstract

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Cited by 38 publications

References 31 publications

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

Audio-Visual Speech Enhancement and Separation by Leveraging Multi-Modal Self-Supervised Embeddings

Is AI at human parity yet? A case study on speech recognition

Contact Info

Product

Resources

About