Visual and textual sentiment analysis using deep fusion convolutional neural networks

Chen, Xingyue; Wang, Yunhong; Liu, Qingjie

doi:10.1109/icip.2017.8296543

Cited by 30 publications

(19 citation statements)

References 15 publications

(21 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Remarkable results have been achieved in [74][75][76][77][78][79], where ensembles of handcrafted features were worked out from images and combined with information provided by text analysis. Those approaches were subsequently outperformed by frameworks that integrated CNNs for extracting features from visual content [12,78,[80][81][82][83][84][85].…”

Section: Sentiment Analysis: Other Applicationsmentioning

confidence: 99%

A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational Costs

et al. 2019

View full text Add to dashboard Cite

Deep convolutional neural networks (CNNs) provide an effective tool to extract complex information from images. In the area of image polarity detection, CNNs are customarily utilized in combination with transfer learning techniques to tackle a major problem: the unavailability of large sets of labeled data. Thus, polarity predictors in general exploit a pre-trained CNN as the feature extractor that in turn feeds a classification unit. While the latter unit is trained from scratch, the pre-trained CNN is subject to fine-tuning. As a result, the specific CNN architecture employed as the feature extractor strongly affects the overall performance of the model. This paper analyses state-of-the-art literature on image polarity detection and identifies the most reliable CNN architectures. Moreover, the paper provides an experimental protocol that should allow assessing the role played by the baseline architecture in the polarity detection task. Performance is evaluated in terms of both generalization abilities and computational complexity. The latter attribute becomes critical as polarity predictors, in the era of social networks, might need to be updated within hours or even minutes. In this regard, the paper gives practical hints on the advantages and disadvantages of the examined architectures both in terms of generalization and computational cost.

show abstract

Section: Sentiment Analysis: Other Applicationsmentioning

confidence: 99%

A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational Costs

et al. 2019

View full text Add to dashboard Cite

show abstract

“…We extract µz as the target vector v t (i.e., v t := µz). Hence, a unified representation of a rap song, which involves both prosodic information and semantic information, can be generated by repeating lines [4][5][6][7][8][9][10] with the returned hyper-parameters in line 24.…”

Section: In Conclusion the Loss Function Of The Vae Network Is Formulated Asmentioning

confidence: 99%

“…Dataset. Following [24], we extract the dominant parts (i.e., verses) of rap songs and obtain 16,697 verses in total 6 . The verses are divided into lines to obtain a dataset of 810,567 lines.…”

Section: Nextline Prediction Taskmentioning

confidence: 99%

“…-EndRhyme [24], which considers the number of matching vowel phonemes at the end of candidate line c i and sκ; -rhyme2vec, our novel rhyme embedding method, as described in Section 3.2; -NN5 [24], a character-level neural network for rap line encoding, which takes five previous lines as the query (i.e.,{s κ−i } 4 i=0 ); -doc2vec [30], a popular sentence embedding method, which handles {sκ; c i } as a unified paragraph; -DopeLearning [24] 7 , a state-of-the-art rap lyric representation learning method, which concatenates a series of statistical characteristics, including the features of EndRhyme, EndRhyme-1 (number of matching vowel phonemes at the end of c i and s κ−1 ), Other-Rhyme (average number of matching vowel phonemes per word), LineLength (line similarity of c i and sκ), BOW (Jaccard similarity between the corresponding bags of words of c i and sκ), BOW5 (Jaccard similarity between the corresponding bags of words of five previous lines and sκ), LSA (latent semantic analysis similarity of c i and sκ), and NN5 (confidence value generated from the last sof tmax layer); -early fusion [6], a widely used multi-modal aggregation method, which concatenates all of the features as a unified representation (i.e., v t := [vr, vs]); -EF-AE, a variant of HAVAE, which adopts the same learning manipulations as that of HAVAE, but bypasses the sampling strategy and renders [vr, vs] as the input of the network; and -EF-VAE, another variant of HAVAE, which renders [vr, vs] as the input of the VAE network instead of the INPUT stage.…”

Section: Nextline Prediction Taskmentioning

confidence: 99%

See 1 more Smart Citation

A general framework for learning prosodic-enhanced representation of rap lyrics

Liang

Wang

et al. 2019

World Wide Web

View full text Add to dashboard Cite

Learning and analyzing rap lyrics is a significant basis for many web applications, such as music recommendation, automatic music categorization, and music information retrieval, due to the abundant source of digital music in the World Wide Web. Although numerous studies have explored the topic, knowledge in this field is far from satisfactory, because critical issues, such as prosodic information and its effective representation, as well as appropriate integration of various features, are usually ignored. In this paper, we propose a hierarchical attention variational autoencoder framework (HAVAE), which simultaneously consider semantic and prosodic features for rap lyrics representation learning. Specifically, the representation of the prosodic features is encoded by phonetic transcriptions with a novel and effective strategy (i.e., rhyme2vec). Moreover, a feature aggregation strategy is proposed to appropriately integrate various features and generate prosodic-enhanced representation. A comprehensive empirical evaluation demonstrates that the proposed framework outperforms the state-of-the-art approaches under various metrics in different rap lyrics learning tasks.

show abstract

“…Multimodal emotion processing has emerged out as a significant research trend over the last few years. Humans reflect various emotions during their communication via visual, textual, and other modalities [2]. Combining complementary information from images and texts could increase emotion recognition accuracy and help the machines become empathetic [3].…”

Section: Introductionmentioning

confidence: 99%

Hybrid Fusion Based Approach for Multimodal Emotion Recognition with Insufficient Labeled Data

Kumar

Khokher²,

Gupta³

et al. 2021

2021 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

In this paper, a deep learning based fusion approach has been proposed to classify the emotions portrayed by image and corresponding text into discrete emotion classes. The proposed method first implements intermediate fusion on image and text inputs and then applies late fusion on image, text, and intermediate fusion's output. We have also come up with a way to handle the unavailability of labeled multimodal emotional data. We have prepared a new dataset built on Balanced Twitter for Sentiment Analysis dataset (B-T4SA) dataset containing an image, text, and emotion labels, i.e., 'happy,' 'sad,' 'hate' and 'anger.' The emotion recognition accuracy of 90.20% has been achieved by the proposed method. Along with multi-class emotion recognition, we've also compared the sentiment classification results and found the proposed method to perform better than the benchmark approaches.

show abstract

Visual and textual sentiment analysis using deep fusion convolutional neural networks

Cited by 30 publications

References 15 publications

A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational Costs

A Survey on Deep Learning in Image Polarity Detection: Balancing Generalization Performances and Computational Costs

A general framework for learning prosodic-enhanced representation of rap lyrics

Hybrid Fusion Based Approach for Multimodal Emotion Recognition with Insufficient Labeled Data

Contact Info

Product

Resources

About