Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Dunbar, Ewan; Hamilakis, Nicolas; Dupoux, Emmanuel

doi:10.1109/jstsp.2022.3206084

Cited by 10 publications

(8 citation statements)

References 117 publications

(75 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We then show that these segments can be clustered across a speech corpus to perform syllable discovery, enabling tokenization of the speech signal at the level of syllable-like units. Finally, we also show surprising results where our model trained only on English speech is able to perform zero-shot segmentation of syllables on another language (Estonian) and words in multiple non-English languages, in several cases outperforming the state-of-the-art models on the Zerospeech challenge [13].…”

Section: Introductionmentioning

confidence: 75%

“…Spoken term discovery -inferring the temporal boundary and identity of words and short phrases from untranscribed speech audio data -has been an important research direction in Zeroresource speech processing [13]. The earliest work that tackles spoken term discovery date back to at least the segmental dynamic programming algorithm proposed by Park and Glass [14].…”

Section: Related Workmentioning

confidence: 99%

“…To evaluate segmentation performance, we use precision, recall, F1 and R-value [51,23]. For the calculation of above metrics, we use a tolerance window of 50ms for SpokenCOCO and Estonian following [17], and 30ms for the Zerospeech Challenge [13]. To evaluate the quality of our syllable clustering, we first match hypothesized syllable segments with the ground truth segments for each utterance.…”

Section: Implementation Detailsmentioning

confidence: 99%

See 2 more Smart Citations

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art. 1

show abstract

Section: Introductionmentioning

confidence: 75%

Section: Related Workmentioning

confidence: 99%

Section: Implementation Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Although this has remained an established paradigm for the study of word segmentation, in recent years the speech research community has made great advances in the area of -  . These studies aim to develop unsupervised methods that learn from raw speech audio only, pioneered in recent years by the Zero Resource Speech Challenge (ZRC) series (Dunbar et al, 2022).…”

Section: Segmenting From Raw Speechmentioning

confidence: 99%

“…The latter continue to perform significantly worse than the formerthe top-performing model for the Zero Speech Challenge (ZRC) series segmentation task achieves a token F 1 -score of only 19.2 on the English portion of the TDE-17 test corpus, compared to 64.5 for the text-based topline system provided by the task. In an effort to explain this gap in performance, Dunbar et al (2022) discuss how the higher granularity of analysis, the lack of invariant quantised acoustic representations and the variability of speech rate all contribute. DYMULTI could be run at a higher granularity of analysis, with features extracted directly from the speech stream, to help bridge this gap.…”

Section: Future Workmentioning

confidence: 99%

Word segmentation from transcriptions of child-directed speech using lexical and sub-lexical cues

GORIELY,

CAINES,

BUTTERY

2023

J. Child Lang.

View full text Add to dashboard Cite

We compare two frameworks for the segmentation of words in child-directed speech, PHOCUS and MULTICUE. PHOCUS is driven by lexical recognition, whereas MULTICUE combines sub-lexical properties to make boundary decisions, representing differing views of speech processing. We replicate these frameworks, perform novel benchmarking and confirm that both achieve competitive results. We develop a new framework for segmentation, the DYnamic Programming MULTIple-cue framework (DYMULTI), which combines the strengths of PHOCUS and MULTICUE by considering both sub-lexical and lexical cues when making boundary decisions. DYMULTI achieves state-of-the-art results and outperforms PHOCUS and MULTICUE on 15 of 26 languages in a cross-lingual experiment. As a model built on psycholinguistic principles, this validates DYMULTI as a robust model for speech segmentation and a contribution to the understanding of language acquisition.

show abstract