2020
DOI: 10.1007/s11042-020-08836-3
|View full text |Cite
|
Sign up to set email alerts
|

Deep learning-based late fusion of multimodal information for emotion classification of music video

Abstract: Affective computing is an emerging area of research that aims to enable intelligent systems to recognize, feel, infer and interpret human emotions. The widely spread online and off-line music videos are one of the rich sources of human emotion analysis because it integrates the composer’s internal feeling through song lyrics, musical instruments performance and visual expression. In general, the metadata which music video customers to choose a product includes high-level semantics like emotion so that automati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
53
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 113 publications
(64 citation statements)
references
References 50 publications
0
53
0
Order By: Relevance
“…A number of strategies have been proposed to combine the learning from multiple representations [ 24 , 45 , 55 , 56 ]. Broadly, the methods can be categorized as early-fusion, mid-fusion, and late-fusion [ 57 , 58 , 59 , 60 ]. These refer to the classification stage at which the information is combined, such as combining the inputs to the CNN in early-fusion, combining the weights of the middle layers of the CNN in mid-fusion and combining the CNN outputs in late-fusion.…”
Section: Literature Reviewmentioning
confidence: 99%
“…A number of strategies have been proposed to combine the learning from multiple representations [ 24 , 45 , 55 , 56 ]. Broadly, the methods can be categorized as early-fusion, mid-fusion, and late-fusion [ 57 , 58 , 59 , 60 ]. These refer to the classification stage at which the information is combined, such as combining the inputs to the CNN in early-fusion, combining the weights of the middle layers of the CNN in mid-fusion and combining the CNN outputs in late-fusion.…”
Section: Literature Reviewmentioning
confidence: 99%
“…At this time, the music label is essential to the quality of music retrieval. In addition to music retrieval, many recommendation and subscription scenarios also require music category information to provide users with more accurate content [4,5].…”
Section: Introductionmentioning
confidence: 99%
“…This article seeks to enhance and improve a supervised music video dataset [ 16 ]. The dataset includes diversified music video samples in six emotional categories and is used in various unimodal and multimodal architectures to analyze music, video, and facial expressions.…”
Section: Introductionmentioning
confidence: 99%
“…We conducted an ablation study on unimodal and multimodal architectures from scratch by using a variety of convolution filters. The major contributions of this study are listed below: We extended and improved an existing music video dataset [ 16 ] and provided emotional annotation by using multiple annotators of diversified cultures. A detailed description of the dataset and statistical information is provided in Section 3 .…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation