Towards time-varying music auto-tagging based on CAL500 expansion

Wang, Shuo-Yang; Wang, Ju-Chiang; Yang, Yi-Hsuan; Wang, Hsin‐Min

doi:10.1109/icme.2014.6890290

Cited by 19 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The CAL500 dataset includes 500 popular songs from Western countries with semantic labels derived from human listeners. One CNN-based music emotion classification [ 32 ] method for the CAL500 dataset, as well as its enriched version (CAL500exp) [ 33 ], was used for the classification of 18 emotion tags in the dataset. Recently, after music source separation [ 34 ] and attention [ 35 ], individual music sources were also applied to improve prediction of emotions in music by using a spectral representation of audio as the input.…”

Section: Related Workmentioning

confidence: 99%

Deep-Learning-Based Multimodal Emotion Classification for Music Videos

Pandeya

Bhattarai

Lee

2021

Sensors

View full text Add to dashboard Cite

Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926.

show abstract

Section: Related Workmentioning

confidence: 99%

Deep-Learning-Based Multimodal Emotion Classification for Music Videos

Pandeya

Bhattarai

Lee

2021

Sensors

View full text Add to dashboard Cite

show abstract

“…CAL500exp is an enriched version of the well-known CAL500. Wang et al [16] published the dataset. Labels of CAL500exp are annotated in the segment level instead of track level in CAL500.…”

Section: Datasetmentioning

confidence: 99%

“…Results on CAL500exp published. The first line of numerical parts shows the result of [16]. We list both of our results at the following two lines on CAL500exp.…”

Section: Model Structurementioning

confidence: 99%

“…We list both of our results at the following two lines on CAL500exp. Description P R F [16] Macro average 0.455 0.759 0.561 CNN Macro average 0.603 0.614 0.596 CNN Micro average 0.686 0.735 0.709…”

Section: Model Structurementioning

confidence: 99%

“…In Section 2 we detail describe the spectrogram for specific data and the structures of our deep neural network. In Section 3 we discuss the experiments on the dataset CAL500 [15] and CAL500exp [16] and compare the results with state-of-the-arts algorithms. Finally, some conclusions and future work are drawn in Section 4.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CNN based music emotion classification

Liu,

Chen,

et al. 2017

Preprint

View full text Add to dashboard Cite

Music emotion recognition (MER) is usually regarded as a multi-label tagging task, and each segment of music can inspire specific emotion tags. Most researchers extract acoustic features from music and explore the relations between these features and their corresponding emotion tags. Considering the inconsistency of emotions inspired by the same music segment for human beings, seeking for the key acoustic features that really affect on emotions is really a challenging task. In this paper, we propose a novel MER method by using deep convolutional neural network (CNN) on the music spectrograms that contains both the original time and frequency domain information. By the proposed method, no additional effort on extracting specific features required, which is left to the training procedure of the CNN model. Experiments are conducted on the standard CAL500 and CAL500exp dataset. Results show that, for both datasets, the proposed method outperforms state-of-the-art methods.

show abstract