A K-Nearest Neighbours Approach To Unsupervised Spoken Term Discovery

Alexis, Thual,; Dancette, Corentin; Karadayi, Julien; Benjumea, Juan; Dupoux, Emmanuel

doi:10.1109/slt.2018.8639515

Cited by 9 publications

(12 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finding positive pairs of speech sequences is an area of research called unsupervised term discovery (UTD) [16][17][18]28]. Such UTD systems can be DTW alignment based [16] or involve a k-Nearest-Neighbours search [28]. We opted for the latter, as it is both scalable and among the state-of-the-art methods.…”

Section: Finding and Choosing Pairs Of Speech Embeddingsmentioning

confidence: 99%

“…We opted for the latter, as it is both scalable and among the state-of-the-art methods. It encodes exhaustively all possible speech sequences with an embedding model, and used optimised k-NN search [29] to retrieve acoustically similar pairs of speech sequences (see the details in [28]). In our experiments, we used the pairs retrieved by k-NN on GD-PLP encoded sequences to train our self-supervised models (CAE,Siamese, CAE-Siamese).…”

Section: Finding and Choosing Pairs Of Speech Embeddingsmentioning

confidence: 99%

“…Each corpus was split into all possible segmentations to produce random speech sequences as described in [28]. Random speech sequences span from 70ms to 1s.…”

Section: Data Setsmentioning

confidence: 99%

“…The speaker embedding network is a single fully connected layer with fifteen neurons. Our UTD system [28] uses the embeddings of the GD-PLP model. A set of speech pairs is returned, sorted by cosine similarity.…”

Section: Training and Hyperparametersmentioning

confidence: 99%

See 3 more Smart Citations

Evaluating the Reliability of Acoustic Speech Embeddings

et al. 2020

Self Cite

View full text Add to dashboard Cite

Speech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to unsupervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimise the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsupervised, and using different loss functions (autoencoders, correspondence autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it unrealistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embeddings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings.

show abstract

Section: Finding and Choosing Pairs Of Speech Embeddingsmentioning

confidence: 99%

Section: Finding and Choosing Pairs Of Speech Embeddingsmentioning

confidence: 99%

“…Each corpus was split into all possible segmentations to produce random speech sequences as described in [28]. Random speech sequences span from 70ms to 1s.…”

Section: Data Setsmentioning

confidence: 99%

Section: Training and Hyperparametersmentioning

confidence: 99%

See 2 more Smart Citations

Evaluating the Reliability of Acoustic Speech Embeddings

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…The model is a convolution and transformer-based embedder trained with the NTXEnt contrastive loss [22]. Building on similar ideas in vision and speech, we select our positive examples through a mix of time-stretching data augmentation [23] and k-Nearerst Neighbors search [24,25]. Figure 1 gives an overview of our method.…”

Section: Introductionmentioning

confidence: 99%

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Algayres,

Nabli,

Sagot

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

We introduce a simple neural encoder architecture that can be trained using an unsupervised contrastive learning objective which gets its positive samples from data-augmented k-Nearest Neighbors search. We show that when built on top of recent self-supervised audio representations [1, 2, 3], this method can be applied iteratively and yield competitive SSE as evaluated on two tasks: query-by-example of random sequences of speech, and spoken term discovery. On both tasks our method pushes the state-of-the-art by a significant margin across 5 different languages. Finally, we establish a benchmark on a query-byexample task on the LibriSpeech dataset to monitor future improvements in the field.

show abstract

Unsupervised Discovery of Sign Terms by K-Nearest Neighbours Approach

Polat

Saraçlar

2020

Computer Vision – ECCV 2020 Workshops

View full text Add to dashboard Cite

A K-Nearest Neighbours Approach To Unsupervised Spoken Term Discovery

Cited by 9 publications

References 20 publications

Evaluating the Reliability of Acoustic Speech Embeddings

Evaluating the Reliability of Acoustic Speech Embeddings

Speech Sequence Embeddings using Nearest Neighbors Contrastive Learning

Unsupervised Discovery of Sign Terms by K-Nearest Neighbours Approach

Contact Info

Product

Resources

About