Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Tseng, Shao-Yen; Li, Juncheng; Wang, Yun; Metze, Florian; Szurley, Joseph; Das, Samarjit

doi:10.21437/interspeech.2018-1120

Cited by 16 publications

(11 citation statements)

References 12 publications

(14 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The official DCASE2017 baseline is give in [4] by using a multilayer perceptron (MLP) classifier, denoted as "DCASE2017 Baseline". The MIL-NN is a multiple instance learning based neural network system proposed in [53]. The CNN-ensemble system is proposed by [16] and ranked the 1st in the SED subtask in Task 4 of the DCASE 2017 challenge.…”

Section: G Automatic Thresholds Optimizationmentioning

confidence: 99%

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Kong

Wang

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) for audio tagging and SED, and show that CNN-Transformer performs similarly to a convolutional recurrent neural network (CRNN). Another challenge of SED is that thresholds are required for detecting sound events. Previous works set thresholds empirically, and are not an optimal approaches. To solve this problem, we propose an automatic threshold optimization method. The first stage is to optimize the system with respect to metrics that do not depend on thresholds, such as mean average precision (mAP). The second stage is to optimize the thresholds with respect to metrics that depends on those thresholds. Our proposed automatic threshold optimization system achieves a state-of-the-art audio tagging F1 of 0.646, outperforming that without threshold optimization of 0.629, and a sound event detection F1 of 0.584, outperforming that without threshold optimization of 0.564.

show abstract

Section: G Automatic Thresholds Optimizationmentioning

confidence: 99%

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Kong

Wang

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…Multiple instance learning (MIL) for the purposes of classifying coarsely labeled audio has been primarily studied for tasks such as audio event detection [19][20][21]. These approaches have been formulated as multi-class event detection using audio data labeled at coarse segments (≥ 10 s).…”

Section: Multiple Instance Learningmentioning

confidence: 99%

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Hebbar¹,

Papadopoulos²,

Reyes³

et al. 2021

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Over the recent years, machine learning techniques have been employed to produce state-of-the-art results in several audio related tasks. The success of these approaches has been largely due to access to large amounts of open-source datasets and enhancement of computational resources. However, a shortcoming of these methods is that they often fail to generalize well to tasks from real life scenarios, due to domain mismatch. One such task is foreground speech detection from wearable audio devices. Several interfering factors such as dynamically varying environmental conditions, including background speakers, TV, or radio audio, render foreground speech detection to be a challenging task. Moreover, obtaining precise moment-to-moment annotations of audio streams for analysis and model training is also time-consuming and costly. In this work, we use multiple instance learning (MIL) to facilitate development of such models using annotations available at a lower time-resolution (coarsely labeled). We show how MIL can be applied to localize foreground speech in coarsely labeled audio and show both bag-level and instance-level results. We also study different pooling methods and how they can be adapted to densely distributed events as observed in our application. Finally, we show improvements using speech activity detection embeddings as features for foreground detection.

show abstract

“…While most approaches have implicitly associated the cliplevel labels with every segment in it, some like Yu et al [11], Feng et al [3] and Tseng et al [9] have viewed a clip as a set of instances, where each instance is a fixed image/audio segment and approached the problem as a multi-instance, multi-labeled (MIML) problem. However, this treatment did not yield the best reported results.…”

Section: Related Workmentioning

confidence: 99%

Time Aggregation Operators for Multi-label Audio Event Detection

et al. 2018

View full text Add to dashboard Cite

In this paper, we present a state-of-the-art system for audio event detection. The labels on the training (and evaluation) data specify the set of events occurring in each audio clip, but neither the time spans nor the order in which they occur. Specifically, our task of weakly supervised learning is the "Detection and Classification of Acoustic Scenes and Events (DCASE) 2017" challenge [5]. We use the winning entry in this challenge given by Xu et al. [10] as our starting point and identify several important modifications that allow us to improve on their results significantly. Our techniques pertain to aggregation and consolidation over time and frequency signals over a (temporal) sequence before decoding the labels. In general, our work is also relevant to other tasks involving learning from weak labeling of sequential data.

show abstract

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Cited by 16 publications

References 12 publications

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Deep multiple instance learning for foreground speech localization in ambient audio from wearable devices

Time Aggregation Operators for Multi-label Audio Event Detection

Contact Info

Product

Resources

About