Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Algayres, Robin; Nabli, Adel; Sagot, Benoît; Dupoux, Emmanuel

doi:10.21437/interspeech.2022-226

Cited by 3 publications

(12 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this work, we explore the setting where we have untranscribed speech of a target language to continue pretraining the self-supervised model f . In addition, we also explore pooling functions with trainable parameters, such as in [6,7,10]. We follow [9,10,19] and train the pooling function g with a contrastive loss.…”

Section: Task Overviewmentioning

confidence: 99%

“…In addition, we also explore pooling functions with trainable parameters, such as in [6,7,10]. We follow [9,10,19] and train the pooling function g with a contrastive loss. Specifically, we use NTXent [20] which is defined as…”

Section: Task Overviewmentioning

confidence: 99%

“…Most previous work on constructing unsupervised AWEs has approached the problem using learned pooling, where positive training pairs of similar speech segments (assumed to be the same word or n-gram) are used to learn a pooling function, based on a reconstruction [6,7,8] or contrastive [9,10] objective. Despite good AWE quality, these methods rely on identifying positive training pairs from a corpus using k-nearestneighbors methods [10,11]. Even with approximate search, such methods are computationally and memory intensive.…”

Section: Introductionmentioning

confidence: 99%

“…• Using continued pretraining with 50 hours of target language data improves the performance of average-pooled Hu-BERT representations considerably, and most of the benefit is achieved with only 20 hours of data; • For the contrastive-learning model, using MPR to identify positive pairs yields a large number of high-quality pairs, resulting in better word discrimination scores than a previous approach [10] while being orders of magnitude faster; • With 50h of data, continued pretraining and contrastive learning have similar performance, but contrastive learning is more data-efficient, and achieves nearly the same results with only one hour of target language data; • Combining both methods yields a small further improvement.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Sanabria¹,

Klejch²,

Tang³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

Acoustic word embeddings are typically created by training a pooling function using pairs of word-like units. For unsupervised systems, these are mined using k-nearest neighbor (KNN) search, which is slow. Recently, mean-pooled representations from a pre-trained self-supervised English model were suggested as a promising alternative, but their performance on target languages was not fully competitive. Here, we explore improvements to both approaches: we use continued pre-training to adapt the self-supervised model to the target language, and we use a multilingual phone recognizer (MPR) to mine phone n-gram pairs for training the pooling function. Evaluating on four languages, we show that both methods outperform a recent approach on word discrimination. Moreover, the MPR method is orders of magnitude faster than KNN, and is highly data efficient. We also show a small improvement from performing learned pooling on top of the continued pre-trained representations.

show abstract

Section: Task Overviewmentioning

confidence: 99%

Section: Task Overviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Sanabria¹,

Klejch²,

Tang³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…One way to do this is to employ the method of Algayres, Nabli, Sagot, and Dupoux (2023), who used the contrastive approach for acoustic word embeddings initiated by Livescu and colleagues to train a classifier without labeled training data (Kamper, Jansen, & Goldwater, 2016; Settle & Livescu, 2016). Here, we describe briefly how it works.…”

Section: Introductionmentioning

confidence: 99%

Computational Modeling of the Segmentation of Sentence Stimuli From an Infant Word‐Finding Study

Swingley,

Algayres

2024

Cognitive Science

Self Cite

View full text Add to dashboard Cite

Computational models of infant word‐finding typically operate over transcriptions of infant‐directed speech corpora. It is now possible to test models of word segmentation on speech materials, rather than transcriptions of speech. We propose that such modeling efforts be conducted over the speech of the experimental stimuli used in studies measuring infants' capacity for learning from spoken sentences. Correspondence with infant outcomes in such experiments is an appropriate benchmark for models of infants. We demonstrate such an analysis by applying the DP‐Parser model of Algayres and colleagues to auditory stimuli used in infant psycholinguistic experiments by Pelucchi and colleagues. The DP‐Parser model takes speech as input, and creates multiple overlapping embeddings from each utterance. Prospective words are identified as clusters of similar embedded segments. This allows segmentation of each utterance into possible words, using a dynamic programming method that maximizes the frequency of constituent segments. We show that DP‐Parse mimics American English learners' performance in extracting words from Italian sentences, favoring the segmentation of words with high syllabic transitional probability. This kind of computational analysis over actual stimuli from infant experiments may be helpful in tuning future models to match human performance.

show abstract

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Algayres

Ricoul²,

Karadayi³

et al. 2022

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Finding word boundaries in continuous speech is challenging as there is little or no equivalent of a ‘space’ delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-the-art in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark. 1

show abstract

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Cited by 3 publications

References 0 publications

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Computational Modeling of the Segmentation of Sentence Stimuli From an Infant Word‐Finding Study

DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon

Contact Info

Product

Resources

About