The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence 2018
DOI: 10.24963/ijcai.2018/463
|View full text |Cite
|
Sign up to set email alerts
|

Learning to Recognize Transient Sound Events using Attentional Supervision

Abstract: Making sense of the surrounding context and ongoing events through not only the visual inputs but also acoustic cues is critical for various AI applications. This paper presents an attempt to learn a neural network model that recognizes more than 500 different sound events from the audio part of user generated videos (UGV). Aside from the large number of categories and the diverse recording conditions found in UGV, the task is challenging because a sound event may occur only for a short period of tim… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 31 publications
(32 citation statements)
references
References 10 publications
0
32
0
Order By: Relevance
“…Our system closely matches the system of Yu et al [23], and outperforms all other systems by a large margin. [27,9] 1M 0.314 0.959 2.452 Kumar [28] 22k 0.213 0.927 Shah [18] 22k 0.229 0.927 Wu [29] 22k 0.927 Kong [22] 2M 0.327 0.965 2.558 Yu [23] 2M 0.360 0.970 2.660 Chen [30] 600k 0.316 Chou [31] 1M 0.327 0.951 [23] uses multi-level attention: attention layers are built upon multiple hidden layers, whose outputs are concatenated and further processed by a fully connected layer to yield a recording-level prediction. No frame-level predictions at all are made in this process.…”
Section: Talnet: Joint Tagging and Localization On Audio Setmentioning
confidence: 99%
“…Our system closely matches the system of Yu et al [23], and outperforms all other systems by a large margin. [27,9] 1M 0.314 0.959 2.452 Kumar [28] 22k 0.213 0.927 Shah [18] 22k 0.229 0.927 Wu [29] 22k 0.927 Kong [22] 2M 0.327 0.965 2.558 Yu [23] 2M 0.360 0.970 2.660 Chen [30] 600k 0.316 Chou [31] 1M 0.327 0.951 [23] uses multi-level attention: attention layers are built upon multiple hidden layers, whose outputs are concatenated and further processed by a fully connected layer to yield a recording-level prediction. No frame-level predictions at all are made in this process.…”
Section: Talnet: Joint Tagging and Localization On Audio Setmentioning
confidence: 99%
“…Attention neural networks have been proposed for AudioSet tagging in [15,16]. Later, a cliplevel and segment-level model with attention supervision was proposed in [36].…”
Section: Audio Tagging With Weakly Labelled Datamentioning
confidence: 99%
“…Strongly supervised attention loss: Existing attention models for weakly labeled AEC are usually trained by minimizing loss between clip-level labels, y ∈ {0, 1} C , and the clip-level predictions [12], [13]. The attention matrix learned in this process will focus on the most relevant and discriminative parts of the audio clip for prediction.…”
Section: Strongly Supervised Attention Modelmentioning
confidence: 99%
“…Recently, attention scheme has been applied to weakly labeled AEC. The attention mechanism helps a model focus on subsections of audio which contribute to the classification while ignoring the irrelevant instances such as background noises [11], [12].…”
Section: Introductionmentioning
confidence: 99%