The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

Nguyen, Tuan Anh; Seyssel, Maureen de; Rozé, Patricia; Rivière, Morgane; Kharitonov, E. M.; Baevski, Alexei; Dunbar, Ewan; Dupoux, Emmanuel

doi:10.48550/arxiv.2011.11588

Cited by 13 publications

(40 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we use the CPC-big model from [17] trained on the LibriLight unlab-6k set [5]. The encoder consists of five convolutional layers each with 512 channels, kernel sizes 10, 8, 4, 4, 4 , and strides 5, 4, 2, 2, 2 .…”

Section: Analysis Of Cpc Features 21 Contrastive Predictive Codingmentioning

confidence: 99%

“…Finally, the linear classifier Wm is replaced with a single-layer transformer. We use the outputs of the second LSTM layer as speech features since they gave the best ABX phone discrimination results in [17]. In the remainder of the paper we refer to these as the CPC features.…”

Section: Analysis Of Cpc Features 21 Contrastive Predictive Codingmentioning

confidence: 99%

“…In contrast to continuous representation learning, acoustic unit discovery involves finding a set of discrete units corresponding to the phonetic inventory of a language [4,15]. We incorporate speaker normalization into a baseline acoustic unit discovery system built on K-means clustering applied to CPC features [17]. Concretely, we cluster the speaker-normalized features using K-means with 50 clusters.…”

Section: Acoustic Unit Discoverymentioning

confidence: 99%

“…Based on this observation, we propose a simple speaker normalization step that effectively removes speaker information (Section 5). We then show that speaker normalization improves performance on two downstream tasks: acoustic unit discovery [13][14][15][16], and spoken language modeling [17,18]. Specifically, we improve an acoustic unit discovery system based on K-means clustering of CPC features (Section 6) and show that an LSTM-based language model trained on the discovered units achieves some of the best scores in the ZeroSpeech2021 challenge [19] (Section 6.2).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

Niekerk¹,

Nortje²,

Kamper³

2021

Preprint

View full text Add to dashboard Cite

Contrastive predictive coding (CPC) aims to learn representations of speech by distinguishing future observations from a set of negative examples. Previous work has shown that linear classifiers trained on CPC features can accurately predict speaker and phone labels. However, it is unclear how the features actually capture speaker and phonetic information, and whether it is possible to normalize out the irrelevant details (depending on the downstream task). In this paper, we first show that the per-utterance mean of CPC features captures speaker information to a large extent. Concretely, we find that comparing means performs well on a speaker verification task. Next, probing experiments show that standardizing the features effectively removes speaker information. Based on this observation, we propose a speaker normalization step to improve acoustic unit discovery using K-means clustering of CPC features. Finally, we show that a language model trained on the resulting units achieves some of the best results in the ZeroSpeech2021 Challenge.

show abstract

Section: Analysis Of Cpc Features 21 Contrastive Predictive Codingmentioning

confidence: 99%

Section: Analysis Of Cpc Features 21 Contrastive Predictive Codingmentioning

confidence: 99%

Section: Acoustic Unit Discoverymentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

Niekerk¹,

Nortje²,

Kamper³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• CPC: We use the embeddings extracted from the pretrained CPC model from [51] as input. The model is trained with a context layer, predicts 12 steps into the future.…”

Section: B Two Stage Process For Extracting Segmental Representationsmentioning

confidence: 99%

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Bhati¹,

Villalba²,

Żelasko³

et al. 2021

Preprint

View full text Add to dashboard Cite

Typically, unsupervised segmentation of speech into the phone and word-like units are treated as separate tasks and are often done via different methods which do not fully leverage the inter-dependence of the two tasks. Here, we unify them and propose a technique that can jointly perform both, showing that these two tasks indeed benefit from each other. Recent attempts employ self-supervised learning, such as contrastive predictive coding (CPC), where the next frame is predicted given past context. However, CPC only looks at the audio signal's frame-level structure. We overcome this limitation with a segmental contrastive predictive coding (SCPC) framework to model the signal structure at a higher level, e.g., phone level. A convolutional neural network learns frame-level representation from the raw waveform via noise-contrastive estimation (NCE). A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE to learn segment representations. The differentiable boundary detector allows us to train frame-level and segment-level encoders jointly. Experiments show that our single model outperforms existing phone and word segmentation methods on TIMIT and Buckeye datasets. We discover that phone class impacts the boundary detection performance, and the boundaries between successive vowels or semivowels are the most difficult to identify. Finally, we use SCPC to extract speech features at the segment level rather than at uniformly spaced frame level (e.g., 10 ms) and produce variable rate representations that change according to the contents of the utterance. We can lower the feature extraction rate from the typical 100 Hz to as low as 14.5 Hz on average while still outperforming the MFCC features on the linear phone classification task.

show abstract

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Dunbar

Hamilakis²,

Dupoux³

2022

IEEE J. Sel. Top. Signal Process.

Self Cite

View full text Add to dashboard Cite

Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks-Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling-and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

show abstract

The Zero Resource Speech Benchmark 2021: Metrics and baselines for unsupervised spoken language modeling

Cited by 13 publications

References 0 publications

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing

Unsupervised Speech Segmentation and Variable Rate Representation Learning using Segmental Contrastive Predictive Coding

Self-Supervised Language Learning From Raw Audio: Lessons From the Zero Resource Speech Challenge

Contact Info

Product

Resources

About