Towards Learning a Universal Non-Semantic Representation of Speech

Shor, Joel; Jansen, Aren; Maor, Ronnie; Lang, Oran; Tuval, Omry; Quitry, Félix de Chaumont; Tagliasacchi, Marco; Shavitt, Ira; Emanuel, Dotan; Haviv, Yinnon

doi:10.21437/interspeech.2020-1242

Cited by 92 publications

(101 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CBoW [16,25] SG [16,25] TemporalGap [16,25] Triplet Loss [16,25] TRILL [13] ble 2 shows that COLA embeddings consistently outperform all these methods. In particular, on acoustic scene classification, we obtain a competitive accuracy of 94% compared to 73% achieved with a triplet loss in [16].…”

Section: Resultsmentioning

confidence: 99%

“…It contains 2 millions excerpts of 10 seconds audio from YouTube videos that are annotated in a multi-label fashion with over 500 classes. This dataset has been used by [16,25,13] for self-supervised pre-training. Since our method is self-supervised, we never use Audioset labels.…”

Section: Datasets and Tasksmentioning

confidence: 99%

“…To allow for comparison with previous methods, we rely on datasets that have been previously used by [16,25,13]. For speaker identification, we use a 100-hours subset of LibriSpeech (LBS) [26] that contains audio of books read by 251 speakers, as well as the Voxceleb [27] subset used in [13], with 1, 251 speakers. For keyword spotting, we use Speech Commands (SPC) [28] V1 and V2 to recognize 11 and 35 spoken commands (classes) from one second of audio, respectively.…”

Section: Datasets and Tasksmentioning

confidence: 99%

“…The instance generation is achieved through noise injection, shifting along time-frequency dimensions, and extracting samples in temporally close neigh-borhoods. Along similar lines, [13] proposed a benchmark for comparing speech representations on non-semantic tasks. Through utilizing a triplet loss as an unsupervised objective with a subset of AudioSet [14] for model training, they showed improved performance on several downstream speech classification tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Our dissimilar pairs simply associate segments from different clips in the same batch, which does not require maintaining a memory bank of distractors as in MOCO. Our approach allows us to consider a large number of negatives for each positive pair in the loss function and bypass the need for a careful choice of negative examples, unlike triplet-based approaches [6,13]. COLA is also different from CPC [2] as it does not predict future latent representations from past ones.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Contrastive Learning of General-Purpose Audio Representations

Saeed

Grangier²,

Zeghidour³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

140

118

View full text Add to dashboard Cite

We introduce COLA, a self-supervised pre-training approach for learning a general-purpose representation of audio. Our approach is based on contrastive learning: it learns a representation which assigns high similarity to audio segments extracted from the same recording while assigning lower similarity to segments from different recordings. We build on top of recent advances in contrastive learning for computer vision and reinforcement learning to design a lightweight, easy-toimplement self-supervised model of audio. We pre-train embeddings on the large-scale Audioset database and transfer these representations to 9 diverse classification tasks, including speech, music, animal sounds, and acoustic scenes. We show that despite its simplicity, our method significantly outperforms previous self-supervised systems. We furthermore conduct ablation studies to identify key design choices and release a library 1 to pre-train and fine-tune COLA models.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Datasets and Tasksmentioning

confidence: 99%

Section: Datasets and Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Contrastive Learning of General-Purpose Audio Representations

Saeed

Grangier²,

Zeghidour³

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

140

118

View full text Add to dashboard Cite

show abstract

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Vygon

Mikhaylovskiy

2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In the past few years, triplet loss-based metric embeddings have become a de-facto standard for several important computer vision problems, most notably, person reidentification. On the other hand, in the area of speech recognition the metric embeddings generated by the triplet loss are rarely used even for classification problems. We fill this gap showing that a combination of two representation learning techniques: a triplet loss-based embedding and a variant of kNN for classification instead of cross-entropy loss significantly (by 26% to 38%) improves the classification accuracy for convolutional networks on a LibriSpeech-derived LibriWords datasets. To do so, we propose a novel phonetic similarity based triplet mining approach. We also match the current best published SOTA for Google Speech Commands dataset V2 10+2-class classification with an architecture that is about 6 times more compact and improve the current best published SOTA for 35class classification on Google Speech Commands dataset V2 by over 40%. 1

show abstract