Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Lee, Jongpil; Nam, Juhan

doi:10.1109/lsp.2017.2713830

Cited by 112 publications

(81 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Systems using raw audio as input to DNN have been proposed in various domains, such as speech recognition [6], music classification, and audio tagging [7]. For example, raw audio has been exploited as features in speech recognition [6] and also for music auto-tagging [7].…”

Section: Related Workmentioning

confidence: 99%

“…For example, raw audio has been exploited as features in speech recognition [6] and also for music auto-tagging [7]. More specifically, the concept of a 1D strided convolution layer that specially designs the first convolution layer for raw audio signals has been proposed [7].…”

Section: Related Workmentioning

confidence: 99%

“…Pipeline process employed by the proposed d-vector based speaker verification using rawaudio-CNN system The speaker identifier DNN architecture exploited in this paper primarily comprises convolution layers. Strided convolution, proposed in [7], is used as the first hidden layer to process raw audio signals. Strided convolution has a short filter size of three, and the stride is also three.…”

Section: Raw Audio Convolutional Neural Network In D-vector Based Spementioning

confidence: 99%

“…As regards the latter type of extension, conventionally handcrafted acoustic features such as mel-frequency cepstral coefficient (MFCC) and mel-filterbank are being used to train speaker identifier DNN and to extract d-vectors in d-vector based speaker verification systems. Alternatively, in some systems, spectrograms and even raw audio are being used as input to the neural network for speech recognition [6] and music tagging [7]. This paper expands on the above systems and proposes a d-vector based speaker verification system in which raw waveform signals are used as input to the speaker identifier DNN.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

D-vector based speaker verification system using Raw Waveform CNN

Jung¹,

Heo²,

Yang³

et al. 2018

Proceedings of the 2017 International Seminar on Artificial Intelligence, Networking and Information Technology (ANIT 2017)

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Raw Audio Convolutional Neural Network In D-vector Based Spementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

D-vector based speaker verification system using Raw Waveform CNN

Jung¹,

Heo²,

Yang³

et al. 2018

Proceedings of the 2017 International Seminar on Artificial Intelligence, Networking and Information Technology (ANIT 2017)

View full text Add to dashboard Cite

“…Deep learning improves these results further, resulting in state-of-the-art performance. Convolutional recurrent neural networks, working on a low-level representation of sounds, have been used for learning features that would be useful in classification task [17,18]. While deep learning in itself performs very well, it creates new opportunities for the use of older machine learning methods.…”

Section: Introductionmentioning

confidence: 99%

Similarity-Based Summarization of Music Files for Support Vector Machines

Jakubik

Kwaśnicka

2018

Complexity

View full text Add to dashboard Cite

Automatic retrieval of music information is an active area of research in which problems such as automatically assigning genres or descriptors of emotional content to music emerge. Recent advancements in the area rely on the use of deep learning, which allows researchers to operate on a low-level description of the music. Deep neural network architectures can learn to build feature representations that summarize music files from data itself, rather than expert knowledge. In this paper, a novel approach to applying feature learning in combination with support vector machines to musical data is presented. A spectrogram of the music file, which is too complex to be processed by SVM, is first reduced to a compact representation by a recurrent neural network. An adjustment to loss function of the network is proposed so that the network learns to build a representation space that replicates a certain notion of similarity between annotations, rather than to explicitly make predictions. We evaluate the approach on five datasets, focusing on emotion recognition and complementing it with genre classification. In experiments, the proposed loss function adjustment is shown to improve results in classification and regression tasks, but only when the learned similarity notion corresponds to a kernel function employed within the SVM. These results suggest that adjusting deep learning methods to build data representations that target a specific classifier or regressor can open up new perspectives for the use of standard machine learning methods in music domain.

show abstract

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Zhu

Wang

et al. 2018

Advances in Multimedia Information Processing – PCM 2018

View full text Add to dashboard Cite

Motivated by the fact that characteristics of different sound classes are highly diverse in different temporal scales and hierarchical levels, a novel deep convolutional neural network (CNN) architecture is proposed for the environmental sound classification task. This network architecture takes raw waveforms as input, and a set of separated parallel CNNs are utilized with different convolutional filter sizes and strides, in order to learn feature representations with multi-temporal resolutions. On the other hand, the proposed architecture also aggregates hierarchical features from multi-level CNN layers for classification using direct connections between convolutional layers, which is beyond the typical single-level CNN features employed by the majority of previous studies. This network architecture also improves the flow of information and avoids vanishing gradient problem. The combination of multi-level features boosts the classification performance significantly. Comparative experiments are conducted on two datasets: the environmental sound classification dataset (ESC-50), and DCASE 2017 audio scene classification dataset. Results demonstrate that the proposed method is highly effective in the classification tasks by employing multi-temporal resolution and multi-level features, and it outperforms the previous methods which only account for single-level features.

show abstract

Multi-Level and Multi-Scale Feature Aggregation Using Pretrained Convolutional Neural Networks for Music Auto-Tagging

Cited by 112 publications

References 11 publications

D-vector based speaker verification system using Raw Waveform CNN

D-vector based speaker verification system using Raw Waveform CNN

Similarity-Based Summarization of Music Files for Support Vector Machines

Environmental Sound Classification Based on Multi-temporal Resolution Convolutional Neural Network Combining with Multi-level Features

Contact Info

Product

Resources

About