2022
DOI: 10.48550/arxiv.2203.01205
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Audio Self-supervised Learning: A Survey

Abstract: Inspired by the humans' cognitive ability to generalise knowledge and skills, Self-Supervised Learning (SSL) targets at discovering general representations from large-scale data without requiring human annotations, which is an expensive and time consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarising the knowledge in audio SSL are currently missing. To fill thi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 93 publications
(168 reference statements)
0
4
0
Order By: Relevance
“…As acoustic features we use: 1. total signal duration, 2. zero crossing rate, 3. mean pitch, 4. local jitter, 5. local shimmer, 6. energy entropy, 7. spectral centroid, and 8. voiced to unvoiced ratio. The acoustic features are extracted using the ComParE2016 [30] feature set of our openSMILE toolkit 5 [31] -with the exception of duration, which is obtained with audiofile. 6 We evaluate predictive performance using root mean squared error (RMSE).…”
Section: Probing #3: Feature Probingmentioning
confidence: 99%
See 1 more Smart Citation
“…As acoustic features we use: 1. total signal duration, 2. zero crossing rate, 3. mean pitch, 4. local jitter, 5. local shimmer, 6. energy entropy, 7. spectral centroid, and 8. voiced to unvoiced ratio. The acoustic features are extracted using the ComParE2016 [30] feature set of our openSMILE toolkit 5 [31] -with the exception of duration, which is obtained with audiofile. 6 We evaluate predictive performance using root mean squared error (RMSE).…”
Section: Probing #3: Feature Probingmentioning
confidence: 99%
“…Recently, deep neural networks (DNNs) consisting of selfattention layers (i. e., transformers) have provided state-of-theart results for speech emotion recognition (SER) and have substantially improved valence prediction [1,2,3,4,5]. These models are typically pre-trained on large corpora in a self-supervised fashion, with the main goal of improving automatic speech recognition performance; thus, they capture a large amount of linguistic information that is beneficial for that task.…”
Section: Introductionmentioning
confidence: 99%
“…There are methods such as TERA (Liu et al, 2021a), Conformer (Gulati et al, 2020) (convolution-augmented transformers, used in speech recognition), and then ViT-like approaches such as the Keyword Transformer (KWT) (Berg et al, 2021) and the Audio Spectrogram Transformer (AST) (Gong et al, 2021a). In recent self-supervised audio representation learning methods, transformer-based encoders have seen much use alongside convolutional or convolutionalrecurrent encoders (Liu et al, 2022).…”
Section: Transformersmentioning
confidence: 99%
“…Attention mechanisms significantly impact deep learning models in many fields, which enrich the information the model could learn from inputs [15]. Attention mechanisms can select, modulate, and focus on the information most important to the target of our problem, like human attention [10]. Therefore, this paper will investigate the effectiveness of data augmentation, SSL models, and attention modules on emotion recognition via vocal burst.…”
Section: Introductionmentioning
confidence: 99%