Ankita Pasad scite author profile

Many self-supervised speech models, varying in their pretraining objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive empirical successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models. 1

show abstract

SLUE: New Benchmark Tasks For Spoken Language Understanding Evaluation on Natural Speech

Shon¹,

Pasad²,

Wu³

et al. 2022

View full text Add to dashboard Cite

Taskology: Utilizing Task Relations at Scale

Pirk

Dlabal

et al. 2021

View full text Add to dashboard Cite

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Shon¹,

Pasad²,

Wu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Layer-wise Analysis of a Self-supervised Speech Representation Model

Pasad¹,

Chou²,

Livescu³

2021

Preprint

View full text Add to dashboard Cite

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the type or extent of information encoded in the pre-trained representations themselves. Developing such insights can help understand the capabilities and limits of these models and enable the research community to more efficiently develop their usage for downstream applications. In this work, we begin to fill this gap by examining one recent and successful pre-trained model (wav2vec 2.0), via its intermediate representation vectors, using a suite of analysis tools. We use the metrics of canonical correlation, mutual information, and performance on simple downstream tasks with non-parametric probes, in order to (i) query for acoustic and linguistic information content, (ii) characterize the evolution of information across model layers, and (iii) understand how fine-tuning the model for automatic speech recognition (ASR) affects these observations. Our findings motivate modifying the fine-tuning protocol for ASR, which produces improved word error rates in a low-resource setting.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ankita Pasad

Layer-Wise Analysis of a Self-Supervised Speech Representation Model

SLUE: New Benchmark Tasks For Spoken Language Understanding Evaluation on Natural Speech

Taskology: Utilizing Task Relations at Scale

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Layer-wise Analysis of a Self-supervised Speech Representation Model

Contact Info

Product

Resources

About