Self-Supervised Speech Representation Learning: A Review

Mohamed, Abdelrahman; Lee, Hung-yi; Borgholt, Lasse; Havtorn, Jakob D.; Edin, Joakim; Igel, Christian; Kirchhoff, Katrin; Li, Shang-Wen; Livescu, Karen; Maaløe, Lars; Sainath, Tara N.; Watanabe, Soichi

doi:10.48550/arxiv.2205.10643

Cited by 13 publications

(14 citation statements)

References 191 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, large deep artificial neural network models pre-trained on a massive amount ofunlabelled waveform features (e.g. [2, 10, 25]), have demonstrated strong generalisation abilities to ASR and many para-linguistic speech tasks [41]. It would be useful to apply our methods used in this paper to study similar types of models and tasks.…”

Section: Discussionmentioning

confidence: 99%

On the similarities of representations in artificial and brain neural networks for speech recognition

Wingfield

Zhang

Devereux

et al. 2022

Preprint

View full text Add to dashboard Cite

How the human brain supports speech comprehension is an important question in neuroscience. Studying the neurocomputational mechanisms underlying human language is not only critical to understand and develop treatments for many human conditions that impair language and communication but also to inform artificial systems that aim to automatically process and identify natural speech. In recent years, intelligent machines powered by deep learning have achieved near human level of performance in speech recognition. The fields of artificial intelligence and cognitive neuroscience have finally reached a similar phenotypical level despite of their huge differences in implementation, and so deep learning models can—in principle—serve as candidates for mechanistic models of the human auditory system. Utilizing high-performance automatic speech recognition systems, and advanced noninvasive human neuroimaging technology such as magnetoencephalography and multivariate pattern-information analysis, the current study aimed to relate machine-learned representations of speech to recorded human brain representations of the same speech. In one direction, we found a quasi-hierarchical functional organisation in human auditory cortex qualitatively matched with the hidden layers of deep neural networks trained in an automatic speech recognizer. In the reverse direction, we modified the hidden layer organization of the artificial neural network based on neural activation patterns in human brains. The result was a substantial improvement in word recognition accuracy and learned speech representations. We have demonstrated that artificial and brain neural networks can be mutually informative in the domain of speech recognition.Author summaryThe human capacity to recognize individual words from the sound of speech is a cornerstone of our ability to communicate with one another, yet the processes and representations underlying it remain largely unknown. Software systems for automatic speech-to-text provide a plausible model for how speech recognition can be performed. In this study, we used an automatic speech recogniser model to probe recordings from the brains of participants who listened to speech. We found that the parts of the dynamic, evolving representations inside the machine system were a good fit for representations found in the brain recordings, both showing similar hierarchical organisations. Then, we observed where the machine’s representations diverged from the brain’s, and made experimental adjustments to the automatic recognizer’s design so that its representations might better fit the brain’s. In so doing, we substantially improved the recognizer’s ability to accurately identify words.

show abstract

Section: Discussionmentioning

confidence: 99%

On the similarities of representations in artificial and brain neural networks for speech recognition

Wingfield

Zhang

Devereux

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…SSL makes use of the data's underlying structure. In SSL classification systems, the model is first pre-trained on some pre-auxiliary task to capture rich embeddings from the innate structure of the data [4,8,16,23]. These embeddings are then used for other downstream classification tasks.…”

Section: Self-supervised Frameworkmentioning

confidence: 99%

End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering Sub-Challenge

Sheikh

Sahidullah

Ouni

et al. 2022

Proceedings of the 30th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we present end-to-end and speech embedding based systems trained in a self-supervised fashion to participate in the ACM Multimedia 2022 ComParE Challenge, specifically the stuttering sub-challenge. In particular, we exploit the embeddings from the pre-trained Wav2Vec2.0 model for stuttering detection (SD) on the KSoF dataset. After embedding extraction, we benchmark with several methods for SD. Our proposed self-supervised based SD system achieves a UAR of 36.9% and 41.0% on validation and test sets respectively, which is 31.32% (validation set) and 1.49% (test set) higher than the best (DeepSpectrum) challenge baseline (CBL). Moreover, we show that concatenating layer embeddings with Mel-frequency cepstral coefficients (MFCCs) features further improves the UAR of 33.81% and 5.45% on validation and test sets respectively over the CBL. Finally, we demonstrate that the summing information across all the layers of Wav2Vec2.0 surpasses the CBL by a relative margin of 45.91% and 5.69% on validation and test sets respectively.

show abstract

“…This paradigm is contrasted with the use case of incremental updates to a pre-trained ASR model presented in this work. A comprehensive survey of such methods for speech representation learning are in [45]. The upstream model is trained with a pretext task such as a generative approach to predict or reconstruct the input given a limited view (eg past data, masking) such as autoregressive predictive coding [12].…”

Section: Related Workmentioning

confidence: 99%

ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale

Chennupati

Rao

Chadha

et al. 2022

Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Incremental learning is one paradigm to enable model building and updating at scale with streaming data. For end-to-end automatic speech recognition (ASR) tasks, the absence of human annotated labels along with the need for privacy preserving policies for model building makes it a daunting challenge. Motivated by these challenges, in this paper we use a cloud based framework for production systems to demonstrate insights from privacy preserving incremental learning for automatic speech recognition (ILASR). By privacy preserving, we mean, usage of ephemeral data which are not human annotated. This system is a step forward for production level ASR models for incremental/continual learning that offers near realtime test-bed for experimentation in the cloud for end-to-end ASR, while adhering to privacy-preserving policies. We show that the proposed system can improve the production models significantly (3%) over a new time period of six months even in the absence of human annotated labels with varying levels of weak supervision and large batch sizes in incremental learning. This improvement is 20% over test sets with new words and phrases in the new time period. We demonstrate the effectiveness of model building in a privacy-preserving incremental fashion for ASR while further exploring the utility of having an effective teacher model and use of large batch sizes. CCS CONCEPTS• Computing methodologies → Speech recognition; Neural networks; Semi-supervised learning settings; • Security and privacy → Privacy-preserving protocols.

show abstract

Self-Supervised Speech Representation Learning: A Review

Cited by 13 publications

References 191 publications

On the similarities of representations in artificial and brain neural networks for speech recognition

On the similarities of representations in artificial and brain neural networks for speech recognition

End-to-End and Self-Supervised Learning for ComParE 2022 Stuttering Sub-Challenge

ILASR: Privacy-Preserving Incremental Learning for Automatic Speech Recognition at Production Scale

Contact Info

Product

Resources

About