Aligned Contrastive Predictive Coding

Marxer, Ricard; Chorowski, Jan; Ciesielski, Grzegorz; Dzikowski, Jarosław; Lancucki, Adrian; Opala, Mateusz; Pusz, Piotr; Rychlikowski, Paweł; Stypułkowski, Michał

doi:10.21437/interspeech.2021-1544

Cited by 15 publications

(12 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Instead of treating classification independently for each future time-step as in standard CPC, the aligned CPC (ACPC) model of Chorowski et al [21] outputs a sequence of predictions that are then aligned to future time-steps. Since the model encourages piece-wise constant latent features, the idea is that changes in these features would correspond to phone boundaries.…”

Section: Related Workmentioning

confidence: 99%

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Kamper¹

2022

Preprint

View full text Add to dashboard Cite

Recent work on unsupervised speech segmentation has used self-supervised models with a phone segmentation module and a word segmentation module that are trained jointly. This paper compares this joint methodology with an older idea: bottom-up phone-like unit discovery is performed first, and symbolic word segmentation is then performed on top of the discovered units (without influencing the lower level). I specifically describe a duration-penalized dynamic programming (DPDP) procedure that can be used for either phone or word segmentation by changing the self-supervised scoring network that gives segment costs. For phone discovery, DPDP is applied with a contrastive predictive coding clustering model, while for word segmentation it is used with an autoencoding recurrent neural network. The two models are chained in order to segment speech. This approach gives comparable word segmentation results to state-of-the-art joint self-supervised models on an English benchmark. On French and Mandarin data, it outperforms previous systems on the ZeroSpeech benchmarks. Analysis shows that the chained DPDP system segments shorter filler words well, but longer words might require an external top-down signal.

show abstract

Section: Related Workmentioning

confidence: 99%

Word Segmentation on Discovered Phone Units with Dynamic Programming and Self-Supervised Scoring

Kamper¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Segmentation performance increases by adding to the model in [1] a second CPC at the segment level (as in [2]). Interestingly ACPC [5] and mACPC do not attain the same segmentation performance level despite their similarities and the offset correction. On the other hand they do achieve much better phoneme prediction rates, both frame synced (frame-wise accuracy) and through alignment (CTC PER).…”

Section: Comparative Study In Segmentation and Classification Of Phon...mentioning

confidence: 93%

“…Finally, K and K s predictions are made at frame and segment levels conditioned on the corresponding context vectors, which are then aligned to M and M s upcoming encoded frames and segments respectively. The ACPC prediction loss, as described in [5], is applied at both levels. The two prediction losses from frames and segments are summed into the total loss to be optimized.…”

Section: Multi-level Acpcmentioning

confidence: 99%

“…Speech self-supervised learning (SSL) without linguistic labels targets a representation that is useful for downstream problems, such as transcription, classification or understanding. Some of the work in the field has focused on automatically detecting boundaries of phonemes or words [1,2,3], while other has concentrated on obtaining a representation that embeds readily available instantaneous phonemic information [4,5]. In other words, an encoding from which we may predict through linear transformations the phoneme of an audio frame.…”

Section: Introductionmentioning

confidence: 99%

“…Different variations of CPC have been proposed in the literature and have shown improvements on various downstream tasks. Chorowski et al [5] presented Aligned CPC (ACPC), in which rather than producing individual predictions for each future representation, the model emits a sequence of K < M predictions which are aligned to the M upcoming representations. In this way, g ar solves a simpler task of predicting the next symbols, but not their exact timing, while g enc is incentivized to produce piece-wise constant latent codes.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words

Cuervo¹,

Maciej²,

Chorowski³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We investigate the performance on phoneme categorization and phoneme and word segmentation of several selfsupervised learning (SSL) methods based on Contrastive Predictive Coding (CPC). Our experiments show that with the existing algorithms there is a trade off between categorization and segmentation performance. We investigate the source of this conflict and conclude that the use of context building networks, albeit necessary for superior performance on categorization tasks, harms segmentation performance by causing a temporal shift on the learned representations. Aiming to bridge this gap, we take inspiration from the leading approach on segmentation, which simultaneously models the speech signal at the frame and phoneme level, and incorporate multi-level modelling into Aligned CPC (ACPC), a variation of CPC which exhibits the best performance on categorization tasks. Our multi-level ACPC (mACPC) improves in all categorization metrics and achieves state-of-the-art performance in word segmentation.

show abstract