Music Emotion Recognition by Multi-label Multi-layer Multi-instance Multi-view Learning

Wu, Bin; Zhong, Erheng; Horner, Andrew; Yang, Qiang

doi:10.1145/2647868.2654904

Cited by 62 publications

(25 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most related to our work are the papers [38]- [45] that proposed the application of MIL for capturing the time ambiguity of pain [38]- [42], affective music response [43], behavioural expressions [44] and vocal interaction [45]. As discussed below, the main differences with our work lie in the (i) multiple instance algorithms we propose, (ii) the nature of the employed predictors (e.g.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Multiple Instance Learning for Emotion Recognition Using Physiological Signals

Romeo

Cavallo

Pepa

et al. 2022

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

The problem of continuous emotion recognition has been the subject of several studies. The proposed affective computing approaches employ sequential machine learning algorithms for improving the classification stage, accounting for the time ambiguity of emotional responses. Modeling and predicting the affective state over time is not a trivial problem because continuous data labeling is costly and not always feasible. This is a crucial issue in real-life applications, where data labeling is sparse and possibly captures only the most important events rather than the typical continuous subtle affective changes that occur. In this work, we introduce a framework from the machine learning literature called Multiple Instance Learning, which is able to model time intervals by capturing the presence or absence of relevant states, without the need to label the affective responses continuously (as required by standard sequential learning approaches). This choice offers a viable and natural solution for learning in a weakly supervised setting, taking into account the ambiguity of affective responses. We demonstrate the reliability of the proposed approach in a gold-standard scenario and towards real-world usage by employing an existing dataset (DEAP) and a purposely built one (Consumer). We also outline the advantages of this method with respect to standard supervised machine learning algorithms.

show abstract

Section: Related Workmentioning

confidence: 99%

“…MIL was used in [43] to automatically recognise the affective content of a piece of music using a generative approach based on a hierarchical Bayesian model. Each song is associated with a bag, and the temporal audio segments form the corresponding instances.…”

Section: Related Workmentioning

confidence: 99%

Multiple Instance Learning for Emotion Recognition Using Physiological Signals

Romeo

Cavallo

Pepa

et al. 2022

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…We then employ Librosa 9 to extract widely-used acoustic features, such as Mel-Frequency Cepstral Coefficients (MFCC) [46], Zero Crossing Rate [47], etc. Finally, we obtain a 512 dimensional feature vector from each audio clip.…”

Section: B Features In Acoustic Modalitymentioning

confidence: 99%

Sentiment Enhanced Multi-Modal Hashtag Recommendation for Micro-Videos

2020

View full text Add to dashboard Cite

Recommending hashtags for micro-videos is a challenging task due to the following two reasons: 1) micro-video is a unity of multi-modalities, including the visual, acoustic, and textual modalities. Therefore, how to effectively extract features from multi-modalities and utilize them to express the microvideo is of great significance; 2) micro-videos usually include moods and feelings, which may provide crucial cues for recommending proper hashtags. However, most of the existing works have not considered the sentiment of media data for hashtag recommendation. In this paper, the senTiment enhanced multi-mOdal Attentive haShtag recommendaTion (TOAST) model is proposed for micro-video hashtag recommendation. Different from previous hashtag recommendation models, which merely consider content features, sentiment features of modalities are further incorporated in TOAST to improve the recommendation performance of the sentiment hashtags (e.g., #funny, #sad). Specifically, the multi-modal content features and the multi-modal sentiment features are modeled by a content common space learning branch based on self-attention and a sentiment common space learning branch, respectively. Furthermore, the varying importance of the multimodal sentiment and content features are dynamically captured via an attention neural network according to their consistency with the hashtag semantic embedding by an attention neural network. Extensive experiments on a real-world dataset have demonstrated the effectiveness of the proposed method compared with the baseline methods. Meanwhile, the findings from the experiments may provide new insight for future developments of micro-video hashtag recommendation. INDEX TERMS Hashtag recommendation, micro-videos, multiple modalities, self-attention mechanism, sentiment features. I. INTRODUCTION Nowadays, watching micro-videos for leisure and entertainment has gained tremendous user enthusiasm. Taking China as an example, the number of micro-video users has risen from 501 million in 2018 to 627 million in 2019, and is predicted to growth to 722 million in 2020 according to the reports in iiMedia. 1 And micro-video platforms and apps, such as Vine, 2 Snapchat, 3 Kuaishou, 4 and Douyin, 5 etc., have also received unprecedented growth in recent years. How to facilitate users to quickly and accurately find their desired The associate editor coordinating the review of this manuscript and approving it for publication was Yin Zhang .

show abstract

“…Each of us tried to describe the excerpt in a single one-word adjective, and the 14 words we used most frequently were selected as the 14 categories. It turns out that about half of the 14 words were used in our previous related studies [79][80][81][82][83][84][85][86][87][88][89][90][91][92][93][94][95][96] and most of the others were used in studies by other researchers [9,20,21,35,47]. All the categories included in the 4-quadrant model in Figure 1 appear in Figure 4 except Angry.…”

Section: Second Test: Best Word From 14 Categoriesmentioning

confidence: 99%

An Analysis of Low-Arousal Piano Music Ratings to Uncover What Makes Calm and Sad Music So Difficult to Distinguish in Music Emotion Recognition

Yu¹,

Chau²,

Horner³

2017

J. Audio Eng. Soc.

Self Cite

View full text Add to dashboard Cite

Music emotion recognition and recommendation systems often use a simplified 4-quadrant model with categories such as Happy, Sad, Angry, and Calm. Previous research has shown that both listeners and automated systems often have difficulty distinguishing low-arousal categories such as Calm and Sad. This paper seeks to explore what makes the categories Calm and Sad so difficult to distinguish. We used 300 low-arousal excerpts from the classical piano repertoire to determine the coverage of the categories Calm and Sad in the low-arousal space, their overlap, and their balance to one another. Our results show that Calm was 40% bigger in terms of coverage than Sad, but that on average Sad excerpts were significantly more negative in mood than Calm excerpts were positive. Calm and Sad overlapped in nearly 20% of the excerpts, meaning 20% of the excerpts were about equally Calm and Sad. Calm and Sad covered about 92% of the low-arousal space, where 8% of the space were holes that were not-at-all Calm or Sad. The largest holes were for excerpts considered Mysterious and Doubtful, but there were smaller holes among positive excerpts as well. Due to the holes in the coverage, the overlaps, and imbalances the Calm-Sad model adds about 6% more errors when compared to asking users directly whether the mood of the music is positive or negative. Nevertheless, the Calm-Sad model is still useful and appropriate for applications in music emotion recognition and recommendation such as when a simple and intuitive interface is preferred or when categorization is more important than precise differentiation.

show abstract

Music Emotion Recognition by Multi-label Multi-layer Multi-instance Multi-view Learning

Cited by 62 publications

References 15 publications

Multiple Instance Learning for Emotion Recognition Using Physiological Signals

Multiple Instance Learning for Emotion Recognition Using Physiological Signals

Sentiment Enhanced Multi-Modal Hashtag Recommendation for Micro-Videos

An Analysis of Low-Arousal Piano Music Ratings to Uncover What Makes Calm and Sad Music So Difficult to Distinguish in Music Emotion Recognition

Contact Info

Product

Resources

About