Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Imoto, Keisuke; Tonami, Noriyuki; Yasuda, Masahiro; Yamanishi, Ryosuke; Yamashita, Yoichi

doi:10.1109/icassp40776.2020.9053912

Cited by 39 publications

(33 citation statements)

References 18 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Scarce research on the combination of related sound classification tasks has been conducted [10,11,13,14]. Imoto et al [13] assumed that ASC and SED are related and performed them simultaneously using a multitask learning framework.…”

Section: Related Workmentioning

confidence: 99%

“…For instance, perceiving car horns and traffic sounds can be helpful for knowing that he/she is standing in a street. Imoto et al [13,14] explored the relation between ASC and SED, proposing DNNs to perform the two tasks simultaneously through a multi-task learning framework [15]. However, the integration of DNNs for related tasks remains in a preliminary stage because only pairs of two related tasks are investigated and motivations on the relationship of these tasks has not been explored.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Jung

Shim

Kim

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Although acoustic scenes and events include many related tasks, their combined detection and classification have been scarcely investigated. We propose three architectures of deep neural networks that are integrated to simultaneously perform acoustic scene classification, audio tagging, and sound event detection. The first two architectures are inspired by human cognitive processes. The first architecture resembles the short-term perception for scene classification of adults, who can detect various sound events that are then used to identify the acoustic scene. The second architecture resembles the long-term learning of babies, being also the concept underlying self-supervised learning. Babies first observe the effects of abstract notions such as gravity and then learn specific tasks using such perceptions. The third architecture adds a few layers to the second one that solely perform a single task before its corresponding output layer. The aim is to build an integrated system that can serve as a pretrained model to perform the three abovementioned tasks. Experiments on three datasets demonstrate that the proposed architecture, called DcaseNet, can be either directly used for any of the tasks while providing suitable results or fine-tuned to improve the performance of one task. The code and pretrained DcaseNet weights are available at https://github.com/Jungjee/DcaseNet.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Jung

Shim

Kim

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Cross-task Transfer. Transferring the learned knowledge from one task to another related task has been approved as an effective way for better data modeling and messages correlating [6,2,14]. Aytar et al [2] proposed a teacher-student framework that transfers the discriminative knowledge of visual recognition to the representation learning task of sound modality via minimizing the differences in the distribution of categories.…”

Section: Related Workmentioning

confidence: 99%

“…Aytar et al [2] proposed a teacher-student framework that transfers the discriminative knowledge of visual recognition to the representation learning task of sound modality via minimizing the differences in the distribution of categories. Imoto et al [14] proposed a method for sound event detection by transferring the knowledge of scenes with soft labels. Gan et al [8] transferred the visual object location knowledge for auditory localization learning.…”

Section: Related Workmentioning

confidence: 99%

“…This multi-task technique is very common within one single modality, such as solving depth estimation, surface normal estimation and semantic segmentation from one single image [7], or recognizing acoustic scenes and sound events from audio [14]. We apply this idea to multi-modality, and implement with Equation (4), encouraging the multimodal model to learn the underlying relationship between the sound events and the scenes for solving the two tasks simultaneously.…”

Section: Audiovisual Representation For Multi-taskmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Mou

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.

show abstract

Polyphonic Sound Event Detection Using Modified Recurrent Temporal Pyramid Neural Network

Venkatesh,

Koolagudi

2024

Communications in Computer and Information Science

View full text Add to dashboard Cite

Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Cited by 39 publications

References 18 publications

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

DCASENET: An Integrated Pretrained Deep Neural Network for Detecting and Classifying Acoustic Scenes and Events

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Polyphonic Sound Event Detection Using Modified Recurrent Temporal Pyramid Neural Network

Contact Info

Product

Resources

About