Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Bhati, Saurabhchand; Villalba, Jesús; Żelasko, Piotr; Moro-Velázquez, Laureano; Dehak, Najim

doi:10.21437/interspeech.2021-1874

Cited by 29 publications

(33 citation statements)

References 24 publications

(53 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table I shows that the DPDP approach outperforms [25], [26], but performs slightly worse on most metrics compared to the recent state-of-the-art unsupervised phone segmentation approaches [13], [14], [38]. Of the three DPDP systems, the CPC+K-means model (here used as a DPDP scoring network for the first time) achieves the best precision and R-value, V. DPDP FOR UNSUPERVISED WORD SEGMENTATION FROM SYMBOLIC INPUT The goal in unsupervised word segmentation from symbolic input is to break up an input sequence (normally of phonemes or phones) into subsequences representing words.…”

Section: Intermediate Evaluation: Unsupervised Phone Segmentationmentioning

confidence: 99%

“…The idea is that a model would need to learn meaningful phonetic contrasts while being invariant to nuisance factors such as speaker. Bhati et al [13] extend this by using a second segment-level CPC layer. The segmental CPC (SCPC) consists of a frame-level CPC module, a differentiable boundary detector operating on the learned features, and a segment-level CPC module operating on aggregated features from the lower layer.…”

Section: Related Workmentioning

confidence: 99%

“…E.g., the Bayesian embedded segmental GMM (BES-GMM) [9] performs probabilistic clustering and segmentation on fixed-dimensional representations of whole word-like segments, as does the related embedded segmental K-means (ES-KMeans) model [10]. Much more recently, self-supervised neural networks using contrastive predictive coding (CPC) [12] have been modified to include separate acoustic unit and word segmentation modules that are trained jointly [13], [14]. These joint models give state-of-theart word segmentation results (more details in §II).…”

Section: Introductionmentioning

confidence: 99%

“…Given that direct word modelling [9], [10] and joint phone and word modelling [13], [14] have proven to be such fruitful directions for speech segmentation, maybe the original premise of [1] was flawed? This paper argues that their idea-where bottom-up acoustic unit discovery and word segmentation are performed separately-should be revisited.…”

Section: Introductionmentioning

confidence: 99%

“…For symbolic word segmentation, DPDP is applied with an autoencoding recurrent neural network (AE-RNN), an approach inspired by [19]. Chaining these two DPDP models gives similar performance to the joint self-supervised segmental [13] and aligned [14] CPC models on English data. The DPDP system also achieves the best reported word segmentation scores on the French and Mandarin ZeroSpeech 2017 and 2020 benchmarks [16], [20].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Kamper¹

2022

Preprint

View full text Add to dashboard Cite

Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly. This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs. For phone discovery, DPDP is applied with a contrastive predictive coding clustering model, while for word segmentation it is used with an autoencoding recurrent neural network. The two models are chained in order to segment speech. This approach gives comparable word segmentation results to state-of-the-art joint self-supervised models on an English benchmark. On French and Mandarin data, it outperforms previous systems on the ZeroSpeech benchmarks. Analysis shows that the chained DPDP system segments shorter filler words well, but longer words might require an external top-down signal.

show abstract

Section: Intermediate Evaluation: Unsupervised Phone Segmentationmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Kamper¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

A Compact Phoneme-To-Audio Aligner for Singing Voice

Zheng,

Bai,

Shi

2023

Advanced Data Mining and Applications

View full text Add to dashboard Cite

Open set classification of sound event

You,

Wu,

Lee

2024

Sci Rep

View full text Add to dashboard Cite

Sound is one of the primary forms of sensory information that we use to perceive our surroundings. Usually, a sound event is a sequence of an audio clip obtained from an action. The action can be rhythm patterns, music genre, people speaking for a few seconds, etc. The sound event classification address distinguishes what kind of audio clip it is from the given audio sequence. Nowadays, it is a common issue to solve in the following pipeline: audio pre-processing→perceptual feature extraction→classification algorithm. In this paper, we improve the traditional sound event classification algorithm to identify unknown sound events by using the deep learning method. The compact cluster structure in the feature space for known classes helps recognize unknown classes by allowing large room to locate unknown samples in the embedded feature space. Based on this concept, we applied center loss and supervised contrastive loss to optimize the model. The center loss tries to minimize the intra- class distance by pulling the embedded feature into the cluster center, while the contrastive loss disperses the inter-class features from one another. In addition, we explored the performance of self-supervised learning in detecting unknown sound events. The experimental results demonstrate that our proposed open-set sound event classification algorithm and self-supervised learning approach achieve sustained performance improvements in various datasets.

show abstract

Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation

Cited by 29 publications

References 24 publications

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

A Compact Phoneme-To-Audio Aligner for Singing Voice

Open set classification of sound event

Contact Info

Product

Resources

About