One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Kim, Jaehun; Urbano, Julián; Liem, Cynthia C. S.; Hanjalic, Alan

doi:10.1007/s00521-019-04076-1

Cited by 41 publications

(29 citation statements)

References 55 publications

(76 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Few related works exist in the audio field -and every randomly weighted neural network we found in the audio literature was a mere baseline [2,7,24]. Inspired by previous computer vision works, we study which audio architectures work the best via evaluating how nontrained CNNs perform as feature extractors.…”

Section: Motivation -From Previous Workmentioning

confidence: 99%

Randomly Weighted CNNs for (Music) Audio Classification

Pons

Serra

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The computer vision literature shows that randomly weighted neural networks perform reasonably as feature extractors. Following this idea, we study how non-trained (randomly weighted) convolutional neural networks perform as feature extractors for (music) audio classification tasks. We use features extracted from the embeddings of deep architectures as input to a classifier -with the goal to compare classification accuracies when using different randomly weighted architectures. By following this methodology, we run a comprehensive evaluation of the current deep architectures for audio classification, and provide evidence that the architectures alone are an important piece for resolving (music) audio problems using deep neural networks.

show abstract

Section: Motivation -From Previous Workmentioning

confidence: 99%

Randomly Weighted CNNs for (Music) Audio Classification

Pons

Serra

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…A similar approach is followed in Section 3. Recent attempts to learn audio embeddings and similarity functions with neural networks has shown promising results [7]. Video Analysis Video analytics software makes surveillance systems more efficient, by reducing the workload on security and management authorities.…”

Section: Related Workmentioning

confidence: 99%

Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

Strobel

Boyer

Lindley

et al. 2018

MultiMedia Modeling

View full text Add to dashboard Cite

The forensic investigation of a terrorist attack poses a huge challenge to the investigative authorities, as several thousand hours of video footage need to be spotted. To assist law enforcement agencies (LEA) in identifying suspects and securing evidences, we present a platform which fuses information of surveillance cameras and video uploads from eyewitnesses. The platform integrates analytical modules for different input-modalities on a scalable architecture. Videos are analyzed according their acoustic and visual content. Specifically, Audio Event Detection is applied to index the content according to attack-specific acoustic concepts. Audio similarity search is utilized to identify similar video sequences recorded from different perspectives. Visual object detection and tracking are used to index the content according to relevant concepts. The heterogeneous results of the analytical modules are fused into a distributed index of visual and acoustic concepts to facilitate rapid start of investigations, following traits and investigating witness reports.

show abstract

“…Considering that lower layers of DC-NNs usually capture lower-level features such as edges from images or spectrograms, we hypothesized that sharing lower layers among the various DCNNs can be effective under the scenario where multiple learning sources are available. With this approach, one can expect that it not only ensures sufficient specialization on taskspecific upper layers, but also benefits from regularization effects on lower layers [14]. Joint learning of multiple tasks with shared layers can prevent the shared layer to overfit for a specific task, instead learning underlying factors that have commonalities required across tasks [6,19].…”

Section: Shared Architecturementioning

confidence: 99%

“…To overcome these potential problems, we therefore apply a label pre-processing step, obtaining Artist Group Factors (AGF) as learning targets, rather than individual artist identities. Finally, we train Deep Convolutional Neural Networks (DCNNs) employing different learning setups, ranging from targeting genre and various types of AGFs with individual networks, to employing a shared architecture as introduced in multiple previous Multi-Task Learning (MTL) works [2,3,6,14,16,18,24,25].…”

Section: Introductionmentioning

confidence: 99%

Transfer Learning of Artist Group Factors to Musical Genre Classification

Kim

Won

Serra

et al. 2018

Companion of the the Web Conference 2018 on the Web Conference 2018 - WWW '18

Self Cite

View full text Add to dashboard Cite

The automated recognition of music genres from audio information is a challenging problem, as genre labels are subjective and noisy. Artist labels are less subjective and less noisy, while certain artists may relate more strongly to certain genres. At the same time, at prediction time, it is not guaranteed that artist labels are available for a given audio segment. Therefore, in this work, we propose to apply the transfer learning framework, learning artist-related information which will be used at inference time for genre classification. We consider different types of artist-related information, expressed through artist group factors, which will allow for more efficient learning and stronger robustness to potential label noise. Furthermore, we investigate how to achieve the highest validation accuracy on the given FMA dataset, by experimenting with various kinds of transfer methods, including single-task transfer, multi-task transfer and finally multi-task learning.

show abstract

One deep music representation to rule them all? A comparative analysis of different representation learning strategies

Cited by 41 publications

References 55 publications

Randomly Weighted CNNs for (Music) Audio Classification

Randomly Weighted CNNs for (Music) Audio Classification

Large Scale Audio-Visual Video Analytics Platform for Forensic Investigations of Terroristic Attacks

Transfer Learning of Artist Group Factors to Musical Genre Classification

Contact Info

Product

Resources

About