2021
DOI: 10.3390/jsan10040072
|View full text |Cite
|
Sign up to set email alerts
|

Comparison of Pre-Trained CNNs for Audio Classification Using Transfer Learning

Abstract: The paper investigates retraining options and the performance of pre-trained Convolutional Neural Networks (CNNs) for sound classification. CNNs were initially designed for image classification and recognition, and, at a second phase, they extended towards sound classification. Transfer learning is a promising paradigm, retraining already trained networks upon different datasets. We selected three ‘Image’- and two ‘Sound’-trained CNNs, namely, GoogLeNet, SqueezeNet, ShuffleNet, VGGish, and YAMNet, and applied … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 57 publications
(30 citation statements)
references
References 40 publications
1
20
0
Order By: Relevance
“…VGGish includes a deep audio embedding mode and is the proposed method for classifying audio from YouTube videos. Pre-trained VGGish is often used for audio classification [40] . A characteristic of the model structure is that several feature extractions are performed using the four block structures combined by convolution and max pooling.…”
Section: Methodsmentioning
confidence: 99%
“…VGGish includes a deep audio embedding mode and is the proposed method for classifying audio from YouTube videos. Pre-trained VGGish is often used for audio classification [40] . A characteristic of the model structure is that several feature extractions are performed using the four block structures combined by convolution and max pooling.…”
Section: Methodsmentioning
confidence: 99%
“…Similarly, for the low-complexity acoustic scene classification dataset, the leading system uses resnet with a receptive field. For Urbansound8k, different systems are proposed [ 16 , 54 ]. These systems use feature pre-processing and post processing, transfer learning, and other methods to enhance the accuracy of the system.…”
Section: Methodsmentioning
confidence: 99%
“…In classification systems, the Mel filter bank energies are extracted using a fast Fourier transform-based algorithm to generate Mel spectrograms. Whether these systems are trained from the scratch using time–frequency representation of sounds [ 6 , 12 , 13 ] or if transfer learning is used to retrain systems trained on images to perform sound classification [ 5 , 14 , 15 , 16 ], they employ Fourier transform for feature extraction. However, there are some crucial restrictions to performing Fourier spectral analysis, which makes Fourier transform valid under extremely general conditions [ 17 , 18 ].…”
Section: Introductionmentioning
confidence: 99%
“…Leveraging transfer learning with CNNs for audio classification problems is studied in some papers. In [19] , the usage of both image and sound CNNs is studied. In the former case, data samples are image-based sound representations such as spectrograms.…”
Section: Related Workmentioning
confidence: 99%
“…In order to evaluate the effectiveness of the transfer learning approaches in our specific application scenario, we take into account 3 state-of-the-art deep learning models for audio classification: YAMNET, VGGish, and L 3 -Net [34] . These models mainly differ in the network’s architecture and training approach.…”
Section: Acoustic Features and Deep Audio Embeddingsmentioning
confidence: 99%