SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Tsai, Hsiang-Sheng; Chang, Heng-Jui; Huang, Wen-Chin; Huang, Zili; Lakhotia, Kushal; Yang, Shuwen; Dong, Shuyan; Liu, Andy T.; Lai, Cheng-I; Shi, Juntai; Chang, Xuankai; Hall, Phil; Chen, Hsuan-Jui; Li, Shang-Wen; Watanabe, Soichi; Mohamed, Abdelrahman; Lee, Hung-yi

doi:10.48550/arxiv.2203.06849

Cited by 4 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Self-supervised representations : In our experiments, we employ HuBERT [27] , which shows promising results over the SUPERB benchmark [10]. 6 To fully explore the potential of HuBERT, we select the HuBERT-large model pre-trained over 60k hours of LibriLight [46,47].…”

Section: Methodsmentioning

confidence: 99%

“…The key idea is that unlabeled data contains valuable information and is far more abundant than labeled data in any domain. This paradigm leads to general-purpose speech representation, suitable for speech processing tasks [10].…”

Section: Speech Representationsmentioning

confidence: 99%

“…(2) can be trained multilingually to facilitate cross-lingual transfers between high resource and low resource languages through shared architecture and weights [8]. On the other hand, end-to-end models can perform poorly when the training data is limited [9] and low resource scenarios often introduce a language-mismatch with the data used to train powerful self-supervised learning (SSL) representations [10].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

Berrebbi¹,

Shi²,

Yan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Self-Supervised Learning (SSL) models have been successfully applied in various deep learning-based speech tasks, particularly those with a limited amount of data. However, the quality of SSL representations depends highly on the relatedness between the SSL training domain(s) and the target data domain. On the contrary, spectral feature (SF) extractors such as log Mel-filterbanks are hand-crafted non-learnable components, and could be more robust to domain shifts. The present work examines the assumption that combining non-learnable SF extractors to SSL models is an effective approach to low resource speech tasks. We propose a learnable and interpretable framework to combine SF and SSL representations. The proposed framework outperforms significantly both baseline and SSL models on Automatic Speech Recognition (ASR) and Speech Translation (ST) tasks on three low resource datasets. We additionally design a mixture of experts based combination model. This last model reveals that the relative contribution of SSL models over conventional SF extractors is very small in case of domain mismatch between SSL training set and the target language data.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Speech Representationsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

Berrebbi¹,

Shi²,

Yan³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…As mentioned earlier, there are many datasets available for English, such as Librispeech (Panayotov et al 2015) and Common Voice (Ardila et al 2020) for ASR, Vox-Celeb1 (Nagrani, Chung, and Zisserman 2017) and Vox-Celeb2 (Chung, Nagrani, and Zisserman 2018) for speaker recognition. More recently, SUPERB (Yang et al 2021b) and SUPERB-SG (Tsai et al 2022) have been released and contain various speech language understanding and synthesis tasks. In contrast, there are very few datasets for Indian languages as summarised in Table 1.…”

Section: Related Workmentioning

confidence: 99%

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian Languages

Javed

Bhogale

Raman³

et al. 2023

AAAI

View full text Add to dashboard Cite

A cornerstone in AI research has been the creation and adoption of standardized training and test datasets to earmark the progress of state-of-the-art models. A particularly successful example is the GLUE dataset for training and evaluating Natural Language Understanding (NLU) models for English. The large body of research around self-supervised BERT-based language models revolved around performance improvements on NLU tasks in GLUE. To evaluate language models in other languages, several language-specific GLUE datasets were created. The area of speech language understanding (SLU) has followed a similar trajectory. The success of large self-supervised models such as wav2vec2 enable creation of speech models with relatively easy to access unlabelled data. These models can then be evaluated on SLU tasks, such as the SUPERB benchmark. In this work, we extend this to Indic languages by releasing the IndicSUPERB benchmark. Specifically, we make the following three contributions. (i) We collect Kathbath containing 1,684 hours of labelled speech data across 12 Indian languages from 1,218 contributors located in 203 districts in India. (ii) Using Kathbath, we create benchmarks across 6 speech tasks: Automatic Speech Recognition, Speaker Verification, Speaker Identification (mono/multi), Language Identification, Query By Example, and Keyword Spotting for 12 languages. (iii) On the released benchmarks, we train and evaluate different self-supervised models alongside the a commonly used baseline FBANK. We show that language-specific fine-tuned models are more accurate than baseline on most of the tasks, including a large gap of 76% for Language Identification task. However, for speaker identification, self-supervised models trained on large datasets demonstrate an advantage. We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.

show abstract

“…Some recent research has explored large-scale pre-training for speech synthesis tasks. For example, the SUPERB-SG [96] benchmark was introduced to evaluate pre-trained models on various tasks including speech enhancement and voice conversion. Prior work on pre-training generative models of speech has focused on learning representations for downstream classification tasks, rather than synthesis [97].…”

Section: Large-scale Pre-training With Speech Datamentioning

confidence: 99%

Speech-Driven Facial Animation Using Polynomial Fusion of Features

Kefalas

Vougioukas

Panagakis

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speech-driven facial animation involves using a speech signal to generate realistic videos of talking faces. Recent deep learning approaches to facial synthesis rely on extracting low-dimensional representations and concatenating them, followed by a decoding step of the concatenated vector. This accounts for only first-order interactions of the features and ignores higher-order interactions. In this paper we propose a polynomial fusion layer that models the joint representation of the encodings by a higher-order polynomial, with the parameters modelled by a tensor decomposition. We demonstrate the suitability of this approach through experiments on generated videos evaluated on a range of metrics on video quality, audiovisual synchronisation and generation of blinks.

show abstract

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities

Cited by 4 publications

References 0 publications

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation

IndicSUPERB: A Speech Processing Universal Performance Benchmark for Indian Languages

Speech-Driven Facial Animation Using Polynomial Fusion of Features

Contact Info

Product

Resources

About