Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Tonami, Noriyuki; Imoto, Keisuke; Yamanishi, Ryosuke; Yamashita, Yoichi

doi:10.1587/transinf.2020edp7036

Cited by 16 publications

(33 citation statements)

References 31 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most conventional works address ASC and SED separately; however, many acoustic scenes and sound events are related mutually. Considering that knowledge of acoustic scenes and sound events can mutually aid in their estimation, the joint analysis of acoustic scenes and sound events based on multitask learning has been proposed [18,19,23].…”

Section: Joint Analysis Of Acoustic Scenes and Sound Events Based On ...mentioning

confidence: 99%

“…Imoto and co-workers proposed ASC methods based on Bayesian generative models, in which information on sound events is considered [16,17]. Bear et al [18], Tonami et al [19], and Jung et al [20] presented methods of jointly analyzing acoustic scenes and sound events based on the multitask learning (MTL) of ASC and SED. These works have revealed that utilizing the relationship between acoustic scenes and sound events improves the performance of each downstream task.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

Imoto¹,

Komatsu²,

Tsubaki³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Acoustic scene classification (ASC) and sound event detection (SED) are fundamental tasks in environmental sound analysis, and many methods based on deep learning have been proposed. Considering that information on acoustic scenes and sound events helps SED and ASC mutually, some researchers have proposed a joint analysis of acoustic scenes and sound events by multitask learning (MTL). However, conventional works have not investigated in detail how acoustic scenes and sound events mutually benefit SED and ASC. We, therefore, investigate the impact of information on acoustic scenes and sound events on the performance of SED and ASC by using domain adversarial training based on a gradient reversal layer (GRL) or model training with fake labels. Experimental results obtained using the TUT Acoustic Scenes 2016/2017 and TUT Sound Events 2016/2017 show that pieces of information on acoustic scenes and sound events are effectively used to detect sound events and classify acoustic scenes, respectively. Moreover, upon comparing GRL-and fake-label-based methods with single-task-based ASC and SED methods, single-task-based methods are found to achieve better performance. This result implies that even when using single-task-based ASC and SED methods, information on acoustic scenes may be implicitly utilized for SED and vice versa.

show abstract

Section: Joint Analysis Of Acoustic Scenes and Sound Events Based On ...mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

Imoto¹,

Komatsu²,

Tsubaki³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We thus manually annotated the audio clips with sound event labels by the procedure described in [24,25]. The resulting audio clips contained 25 types of sound event label [17]. The event label annotations for our experiments are available in [26].…”

Section: Experiments 41 Experimental Conditionsmentioning

confidence: 99%

“…Furthermore, some studies have revealed that the contexts of scenes (e.g., "home," "office," and "cooking"), which are defined by locations, activities, and time, help increase the accuracy of SED [11][12][13][14][15][16][17][18][19]. For example, Heittola et al [12] have proposed a cascade method for SED using results of acoustic scene classification (ASC).…”

Section: Introductionmentioning

confidence: 99%

“…For example, Heittola et al [12] have proposed a cascade method for SED using results of acoustic scene classification (ASC). Bear et al [13], Tonami et al [14,17], and Komatsu et al [16] have proposed joint models of SED and ASC to take advantage of the relationships between sound events and scenes; e.g., the sound event "mouse wheeling" tends to occur in the scene "office," whereas the event "car" is likely to occur in the scene "city center." Cartwright "Home" "Lake" "University"…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sound Event Detection Guided by Semantic Contexts of Scenes

Tonami¹,

Imoto²,

Nagase³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Some studies have revealed that contexts of scenes (e.g., "home," "office," and "cooking") are advantageous for sound event detection (SED). Mobile devices and sensing technologies give useful information on scenes for SED without the use of acoustic signals. However, conventional methods can employ pre-defined contexts in inference stages but not undefined contexts. This is because onehot representations of pre-defined scenes are exploited as prior contexts for such conventional methods. To alleviate this problem, we propose scene-informed SED where pre-defined scene-agnostic contexts are available for more accurate SED. In the proposed method, pre-trained large-scale language models are utilized, which enables SED models to employ unseen semantic contexts of scenes in inference stages. Moreover, we investigated the extent to which the semantic representation of scene contexts is useful for SED. Experimental results performed with TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016/2017 datasets show that the proposed method improves micro and macro F-scores by 4.34 and 3.13 percentage points compared with conventional Conformer-and CNN-BiGRU-based SED, respectively.

show abstract

Sound event detection in traffic scenes based on graph convolutional network to obtain multi-modal information

Jiang,

Guo,

Wang

et al. 2024

Complex Intell. Syst.

View full text Add to dashboard Cite

Sound event detection involves identifying sound categories in audio and determining when they start and end. However, in real-life situations, sound events are usually not isolated. When one sound event occurs, there are often other related sound events that take place as co-occurrences or successive occurrences. The timing relationship of sound events can reflect their characteristics. Therefore, this paper proposes a sound event detection method for traffic scenes based on a graph convolutional network, which considers this timing relationship as a form of multimodal information. The proposed method involves using the acoustic event window method to obtain co-occurrences or successive occurrences of relationship information in the sound signal while filtering out possible noise relationship information. This information is then represented as a graphical structure. Next, the graph convolutional neural network is improved to balance relationship weights between neighbors and itself and to avoid excessive smoothing. It is used to learn the relationship information in the graph structure. Finally, the convolutional recurrent neural network is used to learn the acoustic feature information of sound events, and the relationship information of sound events is obtained by multi-modal fusion to enhance the performance of sound event detection. The experimental results show that using multi-modal information with the proposed method can effectively improve the performance of the model and enhance the perception ability of smart cars in their surrounding environment while driving.

show abstract

Joint Analysis of Sound Events and Acoustic Scenes Using Multitask Learning

Cited by 16 publications

References 31 publications

How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

How Information on Acoustic Scenes and Sound Events Mutually Benefits Event Detection and Scene Classification Tasks

Sound Event Detection Guided by Semantic Contexts of Scenes

Sound event detection in traffic scenes based on graph convolutional network to obtain multi-modal information

Contact Info

Product

Resources

About