The Zero Resource Speech Challenge 2021: Spoken Language Modelling

Dunbar, Ewan; Bernard, Mathieu; Hamilakis, Nicolas; Nguyen, Tu Anh; Seyssel, Maureen de; Rozé, Patricia; Rivière, Morgane; Kharitonov, Eugene; Dupoux, Emmanuel

doi:10.21437/interspeech.2021-1755

Cited by 23 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In machine learning, CPC has been shown to be powerful in a wide variety of modalities ranging from audio and images to natural language and reinforcement learning (25). In the ZeroSpeech 2021 international challenge on unsupervised representation learning, CPC was the best system to develop a perceptual space that accurately discriminate speech sounds (26). The key idea behind CPC is to predict the future states of a sequence given its past context.…”

Section: R a F T Attunement Specificallymentioning

confidence: 99%

Statistical learning models of early phonetic acquisition struggle with child-centered audio data

Lavechin¹,

Seyssel²,

Métais³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement still remain unclear and are at the heart of heated debates in psychology, linguistics, philosophy and neuroscience. The dominant explanation for this perceptual attunement posits that infants apply a domain-general learning mechanism consisting in learning statistical regularities from the speech stream they hear. Such a general learning mechanism has been proposed to account for perceptual attunement effects both in auditory and visual learning, and in both primates and non-primates. Other theories taking into account this perceptual attunement claim that infants are born with an innate specialized language learning device that would allow us to quickly and effortlessly learn from the language(s) we are exposed to. Critically, the feasibility of the purely domain-general statistical learning mechanism has only been demonstrated with computational models on unrealistic and simplified input. Here we propose to simulate early language acquisition from 2000 hours of ecological child-centered audio data in American English and Metropolitan French. We show that when applied on ecologically-valid data, generic learning mechanisms do develop a language-relevant perceptual space but fail to show evidence for perceptual attunement. It is only when supplemented with domain-specific audio filtering and augmentation mechanisms that computational models show a significant attunement to the language they have been exposed to. Hence, we conclude that, when learning from ecological audio, domain-specific mechanisms may be necessary to guide early language learning in the wild even if the learning itself is done through generic mechanisms. We anticipate our work to be a starting point for ecologically-valid computational models of perceptual attunement.

show abstract

Section: R a F T Attunement Specificallymentioning

confidence: 99%

Statistical learning models of early phonetic acquisition struggle with child-centered audio data

Lavechin¹,

Seyssel²,

Métais³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…4) Results: The first round of submissions was documented in 2021 [14]; the best-performing systems were variants of our baseline system. A second round was opened as a NeurIPS 2021 challenge, including a visually-grounded training option.…”

Section: Task 4: Spoken Lmmentioning

confidence: 99%

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Dunbar

Hamilakis²,

Dupoux³

2022

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks-Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling-and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

show abstract

“…Spoken Language Modeling Following the huge success of language models on text data (Devlin et al, 2019;Radford et al, 2019;Brown et al, 2020), the Zero Resource Speech Challenge 2021 (Nguyen et al, 2020;Dunbar et al, 2021) opens up new possibilities for learning high-level language properties from raw audio without any text labels. They introduced 4 zero-shot evaluation metrics at different linguistic levels (phonetic, lexical, syntactic, semantic), along with composite baseline systems consisting of an acoustic discretization module (CPC+k-means) followed by a language model (BERT or LSTM) on the discretized units.…”

Section: Related Workmentioning

confidence: 99%

“…The approach in this work relies on transforming the audio into a sequence of discrete units (or pseudotext) and training a language model on the pseudotext. The trained models displayed better-thanchance performances on nearly all zero-shot evalu- ation metrics of the Zero Resource Challenge 2021 (Nguyen et al, 2020;Dunbar et al, 2021) on different linguistic levels. However, this paradigm creates a discrete bottleneck between a speech encoder and a language model which could be a potential source of error, and in addition requires multiple training phases (learning a an acoustic representation, clustering it, and learning a language model).…”

Section: Introductionmentioning

confidence: 99%

Are discrete units necessary for Spoken Language Modeling?

Nguyen,

Sagot,

Dupoux

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Recent work in spoken language modeling shows the possibility of learning a language unsupervisedly from raw audio without any text labels. The approach relies first on transforming the audio into a sequence of discrete units (or pseudo-text) and then training a language model directly on such pseudo-text. Is such a discrete bottleneck necessary, potentially introducing irreversible errors in the encoding of the speech signal, or could we learn a language model without discrete units at all? In this work, show that discretization is indeed essential for good results in spoken language modeling, but that can omit the discrete bottleneck if we use using discrete target features from a higher level than the input features. We also show that an end-to-end model trained with discrete target like HuBERT achieves similar results as the best language model trained on pseudo-text on a set of zero-shot spoken language modeling metrics from the Zero Resource Speech Challenge 2021.

show abstract

The Zero Resource Speech Challenge 2021: Spoken Language Modelling

Cited by 23 publications

References 0 publications

Statistical learning models of early phonetic acquisition struggle with child-centered audio data

Statistical learning models of early phonetic acquisition struggle with child-centered audio data

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Are discrete units necessary for Spoken Language Modeling?

Contact Info

Product

Resources

About