Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

Xu, Yong; Kong, Qiuqiang; Wang, Wenwu; Plumbley, Mark D.

doi:10.1109/icassp.2018.8461975

Cited by 181 publications

(191 citation statements)

References 13 publications

(13 reference statements)

Supporting

Mentioning

184

Contrasting

Order By: Relevance

“…Weakly supervised method is also a common kind of algorithm for AED tasks [27], [28], [29]. Usually, it is timeconsuming and laborious to accurately annotate the onset and offset for one acoustic event.…”

Section: B Weakly Supervised Event Detectionmentioning

confidence: 99%

Adaptive Multi-Scale Detection of Acoustic Events

Ding

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The goal of acoustic (or sound) events detection (AED or SED) is to predict the temporal position of target events in given audio segments. This task plays a significant role in safety monitoring, acoustic early warning and other scenarios. However, the deficiency of data and diversity of acoustic event sources make the AED task a tough issue, especially for prevalent data-driven methods. In this paper, we start from analyzing acoustic events according to their time-frequency domain properties, showing that different acoustic events have different time-frequency scale characteristics. Inspired by the analysis, we propose an adaptive multi-scale detection (AdaMD) method. By taking advantage of hourglass neural network and gated recurrent unit (GRU) module, our AdaMD produces multiple predictions at different temporal and frequency resolutions. An adaptive training algorithm is subsequently adopted to combine multi-scale predictions to enhance the overall capability. Experimental results on Detection and Classification of Acoustic Scenes and Events 2017 (DCASE 2017) Task 2, DCASE 2016 Task 3 and DCASE 2017 Task 3 demonstrate that the AdaMD outperforms published state-of-the-art competitors in terms of the metrics of event error rate (ER) and F1-score. The verification experiment on our collected factory mechanical dataset also proves the noiseresistant capability of the AdaMD, providing the possibility for it to be deployed in the complex environment. Index Terms-rare acoustic event detection, adaptive multiscale, hourglass network PLACE PHOTO HERE Liang He Liang HE received the B.S. degree in communication engineering from Civil Aviation

show abstract

Section: B Weakly Supervised Event Detectionmentioning

confidence: 99%

Adaptive Multi-Scale Detection of Acoustic Events

Ding

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Most recent advances in polyphonic SED are largely attributed to the use of Machine Learning and Deep Neural Networks [8,9,10,11,12,13]. In particular, the use of Convolutional Recurrent Neural Networks (CRNNs) has significantly improved SED performance in the past few years [14,15,16,17]. However, there are three main disadvantages with current CRNN-based polyphonic SED approaches.…”

Section: Related Workmentioning

confidence: 99%

Polyphonic Sound Event and Sound Activity Detection: A Multi-Task Approach

Pankajakshan

Bear

Benetos

2019

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

View full text Add to dashboard Cite

Polyphonic Sound Event Detection (SED) in real-world recordings is a challenging task because of the dynamic polyphony level, intensity, and duration of sound events. Current polyphonic SED systems fail to model the temporal structure of sound events explicitly and instead attempt to look at which sound events are present at each audio frame. Consequently, the event-wise detection performance is much lower than the segment-wise detection performance. In this work, we propose a joint model approach to improve the temporal localization of sound events using a multi-task learning setup. The first task predicts which sound events are present at each time frame; we call this branch 'Sound Event Detection (SED) model', while the second task predicts if a sound event is present or not at each frame; we call this branch 'Sound Activity Detection (SAD) model'. We verify the proposed joint model by comparing it with a separate implementation of both tasks aggregated together from individual task predictions. Our experiments on the URBAN-SED dataset show that the proposed joint model can alleviate False Positive (FP) and False Negative (FN) errors and improve both the segment-wise and the event-wise metrics.

show abstract

“…These micro-averaged F1 scores can most directly be compared to the outcomes reported between parentheses in Table 1, as their computation is based on the same data [12]. Table 2: F1 scores of prior audio classification models model F1 score fusion of gated convolutional recurrent networks [21] 55.6% capsule-based gated convolutional network [9] 58.6%…”

Section: 13mentioning

confidence: 99%

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Boes

hamme

2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We tackle the task of environmental event classification by drawing inspiration from the transformer neural network architecture used in machine translation. We modify this attention-based feedforward structure in such a way that allows the resulting model to use audio as well as video to compute sound event predictions.We perform extensive experiments with these adapted transformers on an audiovisual data set, obtained by appending relevant visual information to an existing large-scale weakly labeled audio collection. The employed multi-label data contains clip-level annotation indicating the presence or absence of 17 classes of environmental sounds, and does not include temporal information.We show that the proposed modified transformers strongly improve upon previously introduced models and in fact achieve stateof-the-art results. We also make a compelling case for devoting more attention to research in multimodal audiovisual classification by proving the usefulness of visual information for the task at hand, namely audio event recognition.In addition, we visualize internal attention patterns of the audiovisual transformers and in doing so demonstrate their potential for performing multimodal synchronization. CCS CONCEPTS• Computing methodologies → Neural networks.

show abstract

Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network

Cited by 181 publications

References 13 publications

Adaptive Multi-Scale Detection of Acoustic Events

Adaptive Multi-Scale Detection of Acoustic Events

Polyphonic Sound Event and Sound Activity Detection: A Multi-Task Approach

Audiovisual Transformer Architectures for Large-Scale Classification and Synchronization of Weakly Labeled Audio Events

Contact Info

Product

Resources

About