2022
DOI: 10.1162/tacl_a_00505
|View full text |Cite
|
Sign up to set email alerts
|

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Abstract: Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(17 citation statements)
references
References 33 publications
0
12
0
Order By: Relevance
“…Self-supervised learning has also been considered for end-to-end phoneme and word segmentation [20,21]. Mostly recently, Algayres et al [22] identified the key issues in applying text-based models for speech segmentation, and proposed the DP-Parse algorithm which uses instance lexicon to mitigate clustering error. Herman [23] applied vector quantization for phoneme-like unit discovery, and then ran a dynamic program-ming algorithm on the discovered units for word segmentation.…”
Section: Related Workmentioning
confidence: 99%
“…Self-supervised learning has also been considered for end-to-end phoneme and word segmentation [20,21]. Mostly recently, Algayres et al [22] identified the key issues in applying text-based models for speech segmentation, and proposed the DP-Parse algorithm which uses instance lexicon to mitigate clustering error. Herman [23] applied vector quantization for phoneme-like unit discovery, and then ran a dynamic program-ming algorithm on the discovered units for word segmentation.…”
Section: Related Workmentioning
confidence: 99%
“…The SSE model from Algayres et al (2022) is a neural network trained on top of a frozen Wav2vec2 (i.e., Wav2vec2 parameters were kept unchanged). The speech intervals from the corpus are embedded with the Wav2vec2 representation after being distorted by manipulating their duration or pitch characteristics, creating acoustically new versions of each interval.…”
Section: Introductionmentioning
confidence: 99%
“…The SSE model from Algayres (2022) could also be trained on simpler embeddings, such as Mel-filter banks or MFCCs, but the authors have shown the resulting SSEs have much lower word-level discriminative power (Algayres et al, 2022). Even though neural networks generally require a lot of training data, the SSE model from Algayres et al has a small number of trainable parameters (the parameters of the Wav2vec2 model being frozen during training) and can be trained to reasonable performance with only a few spoken utterances, here less than 1 min of audio.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations