DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Algayres, Robin; Ricoul, Tristan; Karadayi, Julien; Hugo, Laurençon,; Zaiem, Salah; Mohame, Abdelrahman; Sagot, Benoît; Dupoux, Emmanuel

doi:10.1162/tacl_a_00505

Cited by 7 publications

(17 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Self-supervised learning has also been considered for end-to-end phoneme and word segmentation [20,21]. Mostly recently, Algayres et al [22] identified the key issues in applying text-based models for speech segmentation, and proposed the DP-Parse algorithm which uses instance lexicon to mitigate clustering error. Herman [23] applied vector quantization for phoneme-like unit discovery, and then ran a dynamic program-ming algorithm on the discovered units for word segmentation.…”

Section: Related Workmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…The SSE model from Algayres et al (2022) is a neural network trained on top of a frozen Wav2vec2 (i.e., Wav2vec2 parameters were kept unchanged). The speech intervals from the corpus are embedded with the Wav2vec2 representation after being distorted by manipulating their duration or pitch characteristics, creating acoustically new versions of each interval.…”

Section: Introductionmentioning

confidence: 99%

“…The SSE model from Algayres (2022) could also be trained on simpler embeddings, such as Mel-filter banks or MFCCs, but the authors have shown the resulting SSEs have much lower word-level discriminative power (Algayres et al, 2022). Even though neural networks generally require a lot of training data, the SSE model from Algayres et al has a small number of trainable parameters (the parameters of the Wav2vec2 model being frozen during training) and can be trained to reasonable performance with only a few spoken utterances, here less than 1 min of audio.…”

Section: Introductionmentioning

confidence: 99%

“…In essence, we try to mimic elements of the cognitive process of infants participating in a speech segmentation experiment, by presenting a learning model with the actual speech materials from that experiment, and probing the outputs of the model to evaluate its alignment with the representations presumed to underlie infants' behavior in the experiments. The computational model is that of Algayres et al (2022), and the infant experiments are two studies reported in Saffran (2009a, 2009b). In these experiments, 8-month-old infants heard fluent sentences of an unfamiliar language (Italian) and extracted words from these sentences.…”

Section: Introductionmentioning

confidence: 99%

“…To do that, DP-Parse relies on a Dirichlet Process formulation inspired by Goldwater et al (2009). The details of the probability formulation are given in Algayres et al (2022). The fourth step is to use these probabilities to estimate the ideal parse of each utterance.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Computational Modeling of the Segmentation of Sentence Stimuli From an Infant Word‐Finding Study

Swingley,

Algayres

2024

Cognitive Science

Self Cite

View full text Add to dashboard Cite

Computational models of infant word‐finding typically operate over transcriptions of infant‐directed speech corpora. It is now possible to test models of word segmentation on speech materials, rather than transcriptions of speech. We propose that such modeling efforts be conducted over the speech of the experimental stimuli used in studies measuring infants' capacity for learning from spoken sentences. Correspondence with infant outcomes in such experiments is an appropriate benchmark for models of infants. We demonstrate such an analysis by applying the DP‐Parser model of Algayres and colleagues to auditory stimuli used in infant psycholinguistic experiments by Pelucchi and colleagues. The DP‐Parser model takes speech as input, and creates multiple overlapping embeddings from each utterance. Prospective words are identified as clusters of similar embedded segments. This allows segmentation of each utterance into possible words, using a dynamic programming method that maximizes the frequency of constituent segments. We show that DP‐Parse mimics American English learners' performance in extracting words from Italian sentences, favoring the segmentation of words with high syllabic transitional probability. This kind of computational analysis over actual stimuli from infant experiments may be helpful in tuning future models to match human performance.

show abstract

Word segmentation from transcriptions of child-directed speech using lexical and sub-lexical cues

GORIELY,

CAINES,

BUTTERY

2023

J. Child Lang.

View full text Add to dashboard Cite

We compare two frameworks for the segmentation of words in child-directed speech, PHOCUS and MULTICUE. PHOCUS is driven by lexical recognition, whereas MULTICUE combines sub-lexical properties to make boundary decisions, representing differing views of speech processing. We replicate these frameworks, perform novel benchmarking and confirm that both achieve competitive results. We develop a new framework for segmentation, the DYnamic Programming MULTIple-cue framework (DYMULTI), which combines the strengths of PHOCUS and MULTICUE by considering both sub-lexical and lexical cues when making boundary decisions. DYMULTI achieves state-of-the-art results and outperforms PHOCUS and MULTICUE on 15 of 26 languages in a cross-lingual experiment. As a model built on psycholinguistic principles, this validates DYMULTI as a robust model for speech segmentation and a contribution to the understanding of language acquisition.

show abstract

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Cited by 7 publications

References 33 publications

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Computational Modeling of the Segmentation of Sentence Stimuli From an Infant Word‐Finding Study

Word segmentation from transcriptions of child-directed speech using lexical and sub-lexical cues

Contact Info

Product

Resources

About