2021
DOI: 10.48550/arxiv.2106.06840
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Abstract: In this paper, we present deep learning frameworks for audio-visual scene classification (SC) and indicate how individual visual and audio features as well as their combination affect SC performance. Our extensive experiments, which are conducted on DCASE (IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events) Task 1B development dataset, achieve the best classification accuracy of 82.2%, 91.1%, and 93.9% with audio input only, visual input only, and both audio-visual input, respect… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 5 publications
(10 reference statements)
0
2
0
Order By: Relevance
“…To deal with the ASC challenge of mismatched recording devices, the state-of-the-art systems mainly leverage ensemble techniques: Ensemble of spectrogram inputs [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12] or ensemble of different classification models [13], [14], [15]. Although these approaches prove effective to deal with the issue of mismatched recording devices and achieve potential results, they present large model complexity.…”
Section: Introductionmentioning
confidence: 99%
“…To deal with the ASC challenge of mismatched recording devices, the state-of-the-art systems mainly leverage ensemble techniques: Ensemble of spectrogram inputs [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12] or ensemble of different classification models [13], [14], [15]. Although these approaches prove effective to deal with the issue of mismatched recording devices and achieve potential results, they present large model complexity.…”
Section: Introductionmentioning
confidence: 99%
“…To deal with one of the main ASC challenges, mismatched recording devices, a variety of methods have been proposed, which mainly make use of ensemble techniques: Ensemble of spectrogram inputs [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11] (i.e., This approach uses multiple spectrogram inputs but only one model architecture) or ensemble of different classification models [12], [13], [14] (i.e., This approach uses only one spectrogram input, but explores the spectrogram by different model architectures). Although these approaches help to achieve good results, they show very large footprint models, which causes challenging to apply on edge-devices.…”
Section: Introductionmentioning
confidence: 99%