Data Science – Analytics and Applications 2022
DOI: 10.1007/978-3-658-36295-9_6
|View full text |Cite
|
Sign up to set email alerts
|

Deep Learning Frameworks Applied For Audio-Visual Scene Classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…As using ensemble is a rule of thumb to improve the ASC performance and shows effective to deal with the issue of mismatched recording devices [50], [16], [17], [51], [52], [53], [54], we therefore apply an ensemble of multiple spectrogram inputs in this paper. In particular, we use three spectrograms: log-Mel [36], Gammatone (Gam) [55], and Constant Q Transform (CQT) [36].…”
Section: B Further Improve Asc Performance By An Ensemble Of Multiple...mentioning
confidence: 99%
“…As using ensemble is a rule of thumb to improve the ASC performance and shows effective to deal with the issue of mismatched recording devices [50], [16], [17], [51], [52], [53], [54], we therefore apply an ensemble of multiple spectrogram inputs in this paper. In particular, we use three spectrograms: log-Mel [36], Gammatone (Gam) [55], and Constant Q Transform (CQT) [36].…”
Section: B Further Improve Asc Performance By An Ensemble Of Multiple...mentioning
confidence: 99%
“…Since the predicted probabilities obtained from individual deep neural network architectures can complement each other and a fusion of these predicted probabilities can help to improve the performance, [20][21][22] we propose an ensemble of predicted probabilities in this paper, referred to as PROD late fusion. Let us consider predicted probability of each deep neural network as pn = (p n1 , pn2 , ..., pnC ), where C is the category number and the n th out of N networks evaluated, the predicted probability after PROD fusion p prod = (p 1 , p2 , ..., pC ) is obtained by:…”
Section: Apply An Ensemble To Enhance the Performancementioning
confidence: 99%
“…Inspired by this simple observation, an increasing number of studies expect to jointly model audio-visual information within scenes. Recent works [13] [14] show that the joint learning of acoustic and visual features can bring additional benefits to AVSC. To exploit the audio-visual information simultaneously, a multi-modal system based on convolutional recurrent neural networks (CRNN) is presented in [15].…”
Section: Introductionmentioning
confidence: 99%