Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1120
|View full text |Cite
|
Sign up to set email alerts
|

Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection

Abstract: State-of-the-art audio event detection (AED) systems rely on supervised learning using strongly labeled data. However, this dependence severely limits scalability to large-scale datasets where fine resolution annotations are too expensive to obtain. In this paper, we propose a small-footprint multiple instance learning (MIL) framework for multi-class AED using weakly annotated labels. The proposed MIL framework uses audio embeddings extracted from a pre-trained convolutional neural network as input features. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(11 citation statements)
references
References 12 publications
(14 reference statements)
0
11
0
Order By: Relevance
“…The official DCASE2017 baseline is give in [4] by using a multilayer perceptron (MLP) classifier, denoted as "DCASE2017 Baseline". The MIL-NN is a multiple instance learning based neural network system proposed in [53]. The CNN-ensemble system is proposed by [16] and ranked the 1st in the SED subtask in Task 4 of the DCASE 2017 challenge.…”
Section: G Automatic Thresholds Optimizationmentioning
confidence: 99%
“…The official DCASE2017 baseline is give in [4] by using a multilayer perceptron (MLP) classifier, denoted as "DCASE2017 Baseline". The MIL-NN is a multiple instance learning based neural network system proposed in [53]. The CNN-ensemble system is proposed by [16] and ranked the 1st in the SED subtask in Task 4 of the DCASE 2017 challenge.…”
Section: G Automatic Thresholds Optimizationmentioning
confidence: 99%
“…Multiple instance learning (MIL) for the purposes of classifying coarsely labeled audio has been primarily studied for tasks such as audio event detection [19][20][21]. These approaches have been formulated as multi-class event detection using audio data labeled at coarse segments (≥ 10 s).…”
Section: Multiple Instance Learningmentioning
confidence: 99%
“…While most approaches have implicitly associated the cliplevel labels with every segment in it, some like Yu et al [11], Feng et al [3] and Tseng et al [9] have viewed a clip as a set of instances, where each instance is a fixed image/audio segment and approached the problem as a multi-instance, multi-labeled (MIML) problem. However, this treatment did not yield the best reported results.…”
Section: Related Workmentioning
confidence: 99%