ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414406
|View full text |Cite
|
Sign up to set email alerts
|

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Abstract: Although acoustic scenes and events include many related tasks, their combined detection and classification have been scarcely investigated. We propose three architectures of deep neural networks that are integrated to simultaneously perform acoustic scene classification, audio tagging, and sound event detection. The first two architectures are inspired by human cognitive processes. The first architecture resembles the short-term perception for scene classification of adults, who can detect various sound event… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(4 citation statements)
references
References 19 publications
(28 reference statements)
0
4
0
Order By: Relevance
“…This is easy to understand because real-life coarse-grained scenes and fine-grained events contain their own different characteristics and attributes. Then, the second-worst model [11] based on MTL [10] attempts to exploit both shared joint and separate individual representations of scenes and events. The third method [14] jointly analyses scenes and events based on the one-way scene-to-event conditional loss.…”
Section: B Results and Analysismentioning
confidence: 99%
“…This is easy to understand because real-life coarse-grained scenes and fine-grained events contain their own different characteristics and attributes. Then, the second-worst model [11] based on MTL [10] attempts to exploit both shared joint and separate individual representations of scenes and events. The third method [14] jointly analyses scenes and events based on the one-way scene-to-event conditional loss.…”
Section: B Results and Analysismentioning
confidence: 99%
“…Additionally, other systems were explored, such as predictors working in the time domain or classifiers based on the features obtained by estimating the fundamental frequencies of the audio segments. Our investigation shares common practices with recent works that explore detecting and classifying audio events in polyphonic environments using deep learning, namely [16], [17] and [18].…”
Section: Introductionmentioning
confidence: 89%
“…Then in [9], robust representations for environmental audio scenes and events are learned by generative model-driven representations and have proved to be effective in audio-related tasks. Another class of studies for joint analysis of scene and event refers to multi-task learning (MTL) [10]. Several convolutional layers are shared in a multi-task model as they [11] expect to learn and utilize shared low-level representations and separated high-level representations of scenes and events.…”
Section: Introductionmentioning
confidence: 99%