An improved speech segmentation quality measure: the r-value

Räsänen, Okko; Laine, Unto K.; Altosaar, Toomas

doi:10.21437/interspeech.2009-538

Cited by 36 publications

(8 citation statements)

References 12 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Evaluation. To evaluate segmentation performance, we use precision, recall, F1 and R-value [51,23]. For the calculation of above metrics, we use a tolerance window of 50ms for SpokenCOCO and Estonian following [17], and 30ms for the Zerospeech Challenge [13].…”

Section: Implementation Detailsmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4 other languages from the Zerospeech Challenge, in some cases beating the previous state-of-the-art. 1

show abstract

Section: Implementation Detailsmentioning

confidence: 99%

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Puyuan¹,

Harwath²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Other performance metrics can be defined for specific segmentation tasks, for example,

R

‐value 82 or path accuracy; 73 however, they are rarely used and are not therefore utilizable for comparison with most other studies.…”

Section: Methodsmentioning

confidence: 99%

Using LSTM neural networks for cross‐lingual phonetic speech segmentation with an iterative correction procedure

Hanzlíček,

Matoušek,

Vít

2023

Computational Intelligence

View full text Add to dashboard Cite

This article describes experiments on speech segmentation using long short‐term memory recurrent neural networks. The main part of the paper deals with multi‐lingual and cross‐lingual segmentation, that is, it is performed on a language different from the one on which the model was trained. The experimental data involves large Czech, English, German, and Russian speech corpora designated for speech synthesis. For optimal multi‐lingual modeling, a compact phonetic alphabet was proposed by sharing and clustering phones of particular languages. Many experiments were performed exploring various experimental conditions and data combinations. We proposed a simple procedure that iteratively adapts the inaccurate default model to the new voice/language. The segmentation accuracy was evaluated by comparison with reference segmentation created by a well‐tuned hidden Markov model‐based framework with additional manual corrections. The resulting segmentation was also employed in a unit selection text‐to‐speech system. The generated speech quality was compared with the reference segmentation by a preference listening test.

show abstract

“…We report the agreement between segment boundaries learned by the downsampling strategy and the boundaries of human-defined information-bearing units, such as phones. We evaluate phone segmentation performance through precision, recall, F1-score and over-segmentation robust R-value [Räsänen et al, 2009] of the predicted pseudo-unit boundaries with respect to phone boundaries obtained from the aligned frame labels. We also evaluate boundary prediction on a processed version of the TIMIT dataset [Garofolo et al, 1993] in which non-speech events have been trimmed to a maximum of 20 ms.…”

Section: Methodsmentioning

confidence: 99%

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Cuervo¹,

Lancucki²,

Marxer³

et al. 2022

Preprint

View full text Add to dashboard Cite

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.Preprint. Under review.

show abstract

An improved speech segmentation quality measure: the r-value

Cited by 36 publications

References 12 publications

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Word Discovery in Visually Grounded, Self-Supervised Speech Models

Using LSTM neural networks for cross‐lingual phonetic speech segmentation with an iterative correction procedure

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Contact Info

Product

Resources

About