DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Jung, Jee-weon; Shim, Hye-jin; Kim, Ju-Ho; Yu, Ha-Jin

doi:10.1109/icassp39728.2021.9414406

“…This is easy to understand because real-life coarse-grained scenes and fine-grained events contain their own different characteristics and attributes. Then, the second-worst model [11] based on MTL [10] attempts to exploit both shared joint and separate individual representations of scenes and events. The third method [14] jointly analyses scenes and events based on the one-way scene-to-event conditional loss.…”

Section: B Results and Analysismentioning

confidence: 99%

Relation-guided acoustic scene classification aided with event embeddings

Hou¹,

Kang²,

Hauwermeiren³

et al. 2022

Preprint

0

View full text Add to dashboard Cite

In real life, acoustic scenes and audio events are naturally correlated. Humans instinctively rely on fine-grained audio events as well as the overall sound characteristics to distinguish diverse acoustic scenes. Yet, most previous approaches treat acoustic scene classification (ASC) and audio event classification (AEC) as two independent tasks. A few studies on scene and event joint classification either use synthetic audio datasets that hardly match the real world, or simply use the multi-task framework to perform two tasks at the same time. Neither of these two ways makes full use of the implicit and inherent relation between fine-grained events and coarse-grained scenes. To this end, this paper proposes a relation-guided ASC (RGASC) model to further exploit and coordinate the scene-event relation for the mutual benefit of scene and event recognition. The TUT Urban Acoustic Scenes 2018 dataset (TUT2018) is annotated with pseudo labels of events by a simple and efficient audiorelated pre-trained model PANN, which is one of the state-ofthe-art AEC models. Then, a prior scene-event relation matrix is defined as the average probability of the presence of each event type in each scene class. Finally, the two-tower RGASC model is jointly trained on the real-life dataset TUT2018 for both scene and event classification. The following results are achieved. 1) RGASC effectively coordinates the true information of coarsegrained scenes and the pseudo information of fine-grained events.2) The event embeddings learned from pseudo labels under the guidance of prior scene-event relations help reduce the confusion between similar acoustic scenes. 3) Compared with other (nonensemble) methods, RGASC improves the scene classification accuracy on the real-life dataset.

show abstract

“…Additionally, other systems were explored, such as predictors working in the time domain or classifiers based on the features obtained by estimating the fundamental frequencies of the audio segments. Our investigation shares common practices with recent works that explore detecting and classifying audio events in polyphonic environments using deep learning, namely [16], [17] and [18].…”

Section: Introductionmentioning

confidence: 89%

Automatic parameter estimation and detection of Saimaa ringed seal knocking vocalizations

Solana,

Houegnigan,

Nadeu

et al. 2024

Preprint

0

View full text Add to dashboard Cite

The Saimaa ringed seal (Pusa hispida saimensis) is an endangered subspecies of ringed seal that inhabits Finland's Lake Saimaa. Many efforts have been put into studying their ecology; however, these initiatives heavily rely on human intervention, making them costly. This study first analyzes an extensive dataset of acoustic recordings from Lake Saimaa with a focus on "knocking" vocalizations, the most commonly found Saimaa ringed seal call type. Then, the dataset is used to train and test a binary deep learning classification system to detect these vocalizations. Out of the 8996 annotated knocking events, the model is trained and tuned with 8096 samples and tested with the remaining 900 events. The system achieves a 97% F1-Score in the test set, demonstrating its capacity to identify knocking segments from noise and other events.

show abstract

“…Then in [9], robust representations for environmental audio scenes and events are learned by generative model-driven representations and have proved to be effective in audio-related tasks. Another class of studies for joint analysis of scene and event refers to multi-task learning (MTL) [10]. Several convolutional layers are shared in a multi-task model as they [11] expect to learn and utilize shared low-level representations and separated high-level representations of scenes and events.…”

Section: Introductionmentioning

confidence: 99%

Relation-guided acoustic scene classification aided with event embeddings

Hou

¹

,

Kang

²

,

Hauwermeiren

³

et al. 2022

2022 International Joint Conference on Neural Networks (IJCNN)

View full text Add to dashboard Cite

In real life, acoustic scenes and audio events are naturally correlated. Humans instinctively rely on fine-grained audio events as well as the overall sound characteristics to distinguish diverse acoustic scenes. Yet, most previous approaches treat acoustic scene classification (ASC) and audio event classification (AEC) as two independent tasks. A few studies on scene and event joint classification either use synthetic audio datasets that hardly match the real world, or simply use the multi-task framework to perform two tasks at the same time. Neither of these two ways makes full use of the implicit and inherent relation between fine-grained events and coarse-grained scenes. To this end, this paper proposes a relation-guided ASC (RGASC) model to further exploit and coordinate the scene-event relation for the mutual benefit of scene and event recognition. The TUT Urban Acoustic Scenes 2018 dataset (TUT2018) is annotated with pseudo labels of events by a simple and efficient audiorelated pre-trained model PANN, which is one of the state-ofthe-art AEC models. Then, a prior scene-event relation matrix is defined as the average probability of the presence of each event type in each scene class. Finally, the two-tower RGASC model is jointly trained on the real-life dataset TUT2018 for both scene and event classification. The following results are achieved. 1) RGASC effectively coordinates the true information of coarsegrained scenes and the pseudo information of fine-grained events.2) The event embeddings learned from pseudo labels under the guidance of prior scene-event relations help reduce the confusion between similar acoustic scenes. 3) Compared with other (nonensemble) methods, RGASC improves the scene classification accuracy on the real-life dataset.

show abstract

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Cited by 16 publications

References 19 publications

Relation-guided acoustic scene classification aided with event embeddings

Relation-guided acoustic scene classification aided with event embeddings

Automatic parameter estimation and detection of Saimaa ringed seal knocking vocalizations

Relation-guided acoustic scene classification aided with event embeddings

Contact Info

Product

Resources

About