Sound Event Detection by Multitask Learning of Sound Events and Scenes with Soft Scene Labels

Imoto, Keisuke; Tonami, Noriyuki; Yasuda, Masahiro; Yamanishi, Ryosuke; Yamashita, Yoichi

doi:10.48550/arxiv.2002.05848

Cited by 3 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cross-task Transfer. Transferring the learned knowledge from one task to another related task has been approved as an effective way for better data modeling and messages correlating [7,2,14]. Aytar et al [2] proposed a teacher-student framework that transfers the discriminative knowledge of visual recognition to the representation learning task of sound modality via minimizing the differences in the distribution of categories.…”

Section: Related Workmentioning

confidence: 99%

“…Chaplot et al [5] utilized a dual-attention unit to align textual and visual representations with the transferred knowledge of words and objects. Due to the partial correlation of scenes and sound events, Imoto et al [14] proposed a method for sound event detection by transferring the knowledge of scenes with soft labels. Salem et al [25] proposed to transfer the sound clustering knowledge to the image recognition task by predicting the distribution of sound clusters from an overhead image, similarly work can be found in [22].…”

Section: Related Workmentioning

confidence: 99%

“…This simultaneous multi-task technique is very common within one single modality, such as solving depth estimation, surface normal estimation and semantic segmentation from one single image [8], or recognizing acoustic scenes and sound events from audio [14], hoping that the model can learn the underlying relationships among the tasks. We implement this idea by either Equation ( 5) or (6), to encourage the model to solve two tasks simultaneously and find the underlying relation between sound events and scenes.…”

Section: Audiovisual Representation For Multi-taskmentioning

confidence: 99%

See 2 more Smart Citations

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Mou

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Audiovisual Representation For Multi-taskmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition

Mou

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…Recently there have been few studies to jointly conduct the ASC and the audio tagging task [10][11][12]. Bear et.…”

Section: Introductionmentioning

confidence: 99%

Acoustic Scene Classification using Audio Tagging

Jung¹,

Shim²,

Kim³

et al. 2020

Preprint

View full text Add to dashboard Cite

Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which affects the judgement. The proposed framework mimics this mechanism using various approaches. Firstly, we employ three methods to concatenate tag vectors extracted using an audio tagging system with an intermediate hidden layer of an acoustic scene classification system. We also explore the multi-head attention on the feature map of an acoustic scene classification system using tag vectors. Experiments conducted on the detection and classification of acoustic scenes and events 2019 task 1-a dataset demonstrate the effectiveness of the proposed scheme. Concatenation and multi-head attention show a classification accuracy of 75.66 % and 75.58 %, respectively, compared to 73.63 % accuracy of the baseline. The system with the proposed two approaches combined demonstrates an accuracy of 76.75 %.

show abstract