2022
DOI: 10.1109/jstsp.2022.3206084
|View full text |Cite
|
Sign up to set email alerts
|

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Abstract: Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks-Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling-and in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(8 citation statements)
references
References 117 publications
(75 reference statements)
0
5
0
Order By: Relevance
“…We then show that these segments can be clustered across a speech corpus to perform syllable discovery, enabling tokenization of the speech signal at the level of syllable-like units. Finally, we also show surprising results where our model trained only on English speech is able to perform zero-shot segmentation of syllables on another language (Estonian) and words in multiple non-English languages, in several cases outperforming the state-of-the-art models on the Zerospeech challenge [13].…”
Section: Introductionmentioning
confidence: 75%
See 2 more Smart Citations
“…We then show that these segments can be clustered across a speech corpus to perform syllable discovery, enabling tokenization of the speech signal at the level of syllable-like units. Finally, we also show surprising results where our model trained only on English speech is able to perform zero-shot segmentation of syllables on another language (Estonian) and words in multiple non-English languages, in several cases outperforming the state-of-the-art models on the Zerospeech challenge [13].…”
Section: Introductionmentioning
confidence: 75%
“…Spoken term discovery -inferring the temporal boundary and identity of words and short phrases from untranscribed speech audio data -has been an important research direction in Zeroresource speech processing [13]. The earliest work that tackles spoken term discovery date back to at least the segmental dynamic programming algorithm proposed by Park and Glass [14].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Although this has remained an established paradigm for the study of word segmentation, in recent years the speech research community has made great advances in the area of -  . These studies aim to develop unsupervised methods that learn from raw speech audio only, pioneered in recent years by the Zero Resource Speech Challenge (ZRC) series (Dunbar et al, 2022).…”
Section: Segmenting From Raw Speechmentioning
confidence: 99%
“…The latter continue to perform significantly worse than the formerthe top-performing model for the Zero Speech Challenge (ZRC) series segmentation task achieves a token F 1 -score of only 19.2 on the English portion of the TDE-17 test corpus, compared to 64.5 for the text-based topline system provided by the task. In an effort to explain this gap in performance, Dunbar et al (2022) discuss how the higher granularity of analysis, the lack of invariant quantised acoustic representations and the variability of speech rate all contribute. DYMULTI could be run at a higher granularity of analysis, with features extracted directly from the speech stream, to help bridge this gap.…”
Section: Future Workmentioning
confidence: 99%