2017
DOI: 10.1109/taslp.2016.2632307
|View full text |Cite
|
Sign up to set email alerts
|

Abstract: Abstract-Identifying musical instruments in polyphonic music recordings is a challenging but important problem in the field of music information retrieval. It enables music search by instrument, helps recognize musical genres, or can make music transcription easier and more accurate. In this paper, we present a convolutional neural network framework for predominant instrument recognition in real-world polyphonic music. We train our network from fixed-length music excerpts with a single-labeled predominant inst… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
97
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 145 publications
(99 citation statements)
references
References 22 publications
(24 reference statements)
2
97
0
Order By: Relevance
“…BN is applied to accelerate the training and stabilize the internal covariate shift for every convolution layer and the fc-feature layer [61]. Also, global spatial pooling is adopted as the last pooling layer of the cascading convolution blocks, which is known to effectively summarize the spatial dimensions both in the image [22] and music domain [20]. We also applied the approach to ensure the fc-feature layer not to have a huge number of parameters.…”
Section: Base Architecturementioning
confidence: 99%
See 2 more Smart Citations
“…BN is applied to accelerate the training and stabilize the internal covariate shift for every convolution layer and the fc-feature layer [61]. Also, global spatial pooling is adopted as the last pooling layer of the cascading convolution blocks, which is known to effectively summarize the spatial dimensions both in the image [22] and music domain [20]. We also applied the approach to ensure the fc-feature layer not to have a huge number of parameters.…”
Section: Base Architecturementioning
confidence: 99%
“…For this purpose, we use the dB-scale mel-scale magnitude spectrum of an input audio fragment, extracted by applying 128-band mel-filter banks on the Short-Time Fourier Transform (STFT). mel-spectrograms have generally been a popular input representation choice for CNNs applied in music-related tasks [16,17,20,26,41,64]; besides, it also was reported recently that their frequency-domain summarization, based on psycho-acoustics, is efficient and not easily learnable through data-driven approaches [65,66]. We choose a 1024-sample window size and 256-sample hop size, translating to about 46 ms and 11.6 ms respectively for a sampling rate of 22 kHz.…”
Section: Audio Preprocessingmentioning
confidence: 99%
See 1 more Smart Citation
“…We may also predict the bounding boxes by using bounding-box annotations, which are easier to collect than the pixel-level annotations. PASCAL VOC datasets are commonly used datasets for bounding-boxbased object detection 16 [54], which contain images of 20 classes with bounding-box information, but the classes do not contain instruments either. Nevertheless, we can train an object detection model by using ImageNet-Instrument data we have collected from the ImageNet website, which have been used to evaluate the object model as described in Section IV-C3.…”
Section: Appendix a Possible Alternative Models For The Object Modelmentioning
confidence: 99%
“…For their ability in extracting robust spectral-temporal structures of different audio signals [10], convolutional neural networks (CNN) have been used successfully to learn useful features in many audio processing applications such as: speech recognition [11], speech enhancement [12], audio tagging [13], and many music related applications [14,15,16]. Convolutional denoising autoencoders (CDAEs) are also a special type of CNNs that can be used to discover robust localized low-dimensional patterns that repeat themselves over the input [17,18].…”
Section: Introductionmentioning
confidence: 99%