Proceedings of the 24th ACM International Conference on Multimedia 2016
DOI: 10.1145/2964284.2964292
|View full text |Cite
|
Sign up to set email alerts
|

Event Localization in Music Auto-tagging

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
33
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
7
1

Relationship

5
3

Authors

Journals

citations
Cited by 33 publications
(35 citation statements)
references
References 21 publications
2
33
0
Order By: Relevance
“…We use the output of the Conv5 layer as the output, S, of a sound model. [37]. There are three scales of input feature maps, and each of them has their own stack of early convolutions (Conv1 and Conv2).…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We use the output of the Conv5 layer as the output, S, of a sound model. [37]. There are three scales of input feature maps, and each of them has their own stack of early convolutions (Conv1 and Conv2).…”
Section: Methodsmentioning
confidence: 99%
“…Schlüter utilized saliency maps to iteratively train a model that can recognize singing voices in the frame level [36]. Liu et al applied FCNs to a general music autotagging problem so that the model detects various music-related properties in the frame level, including genres, instruments, vocals, etc [37]. In this work, we also utilize an FCN model to derive the frame-level instrument sound predictions.…”
Section: B Audio Classificationmentioning
confidence: 99%
“…• Since the network is supposed to handle variable length speech signals, we opt for a fully-convolutional architecture [17]. Following [4,18], we use "1D convolutional" [19] layers rather than 2D convolutional layers, to add flexibility of using recurrent layers in conjunction with the convolutional layers.…”
Section: Network Architecturementioning
confidence: 99%
“…In our model, the convolution layers use 1D convolutions, namely doing convolutions along the temporal axis [6], [7]. The output tensor of an 1D convolution layer takes the shape (channels, temporal points).…”
Section: A Separation Modelmentioning
confidence: 99%