2019
DOI: 10.48550/arxiv.1905.11796
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised audio representation learning for mobile devices

Abstract: We explore self-supervised models that can be potentially deployed on mobile devices to learn general purpose audio representations. Specifically, we propose methods that exploit the temporal context in the spectrogram domain. One method estimates the temporal gap between two short audio segments extracted at random from the same audio clip. The other methods are inspired by Word2Vec, a popular technique used to learn word embeddings, and aim at reconstructing a temporal spectrogram slice from past and future … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
31
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 17 publications
(31 citation statements)
references
References 26 publications
(31 reference statements)
0
31
0
Order By: Relevance
“…For assessing the quality of the self-supervised embeddings, we conduct experiments with a linear classifier on the end-tasks. Linear separability is a standard way of measuring the power of self-supervised-learned features in the literature [12,38,52], i.e. if the representations disentangle factors of variations in the input, then it becomes easier to solve subsequent tasks.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…For assessing the quality of the self-supervised embeddings, we conduct experiments with a linear classifier on the end-tasks. Linear separability is a standard way of measuring the power of self-supervised-learned features in the literature [12,38,52], i.e. if the representations disentangle factors of variations in the input, then it becomes easier to solve subsequent tasks.…”
Section: Resultsmentioning
confidence: 99%
“…A similar technique is proposed in [54] to learn from multiple views of the data. [52] defined self-supervised tasks for audio, inspired by word2vec [34]. [25] showed that video representations could be learned by exploiting audio-visual temporal synchronization.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…for downstream tasks. In speech representation learning (Latif et al, 2020), unsupervised techniques such as autoregressive modeling (Chung, Hsu, Tang and Glass, 2019;Chung and Glass, 2020a,b) and self-supervised modeling (Milde and Biemann, 2018;Tagliasacchi, Gfeller, Quitry and Roblek, 2019;Pascual, Ravanelli, Serrà, Bonafonte and Bengio, 2019) employ temporal context information for extracting speech representation. In our prior behavior modeling work, an unsupervised representative learning framework was proposed (Li, Baucom and Georgiou, 2017), which showed the promise of learning behavior representations based on the behavior stationarity hypothesis that nearby segments of speech share the same behavioral context.…”
Section: Related Work and Motivationmentioning
confidence: 99%
“…Through utilizing a triplet loss as an unsupervised objective with a subset of AudioSet [14] for model training, they showed improved performance on several downstream speech classification tasks. Inspired from seminal work in NLP [15], the work in [16] adopted a similar approach to learn audio representations (i.e. AUDIO2VEC) along with another "pretext" task of estimating temporal distance between audio segments.…”
Section: Introductionmentioning
confidence: 99%