Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Franceschini, Riccardo; Fini, Enrico; Beyan, Çiğdem; Conti, Alessandro; Arrigoni, Federica; Ricci, Elisa

doi:10.1109/icpr56361.2022.9956589

Cited by 11 publications

(10 citation statements)

References 61 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Motivated by the successful application of CL in unsupervised learning (Oord, Li, and Vinyals 2018;He et al 2020), Supervised Contrastive Learning (SCL) (Khosla et al 2020) is devised to promote a series of supervised tasks. Recently, CL has been applied to multi-modal tasks to strengthen the interaction between features of different modalities (Zheng et al 2022;Franceschini et al 2022;Zolfaghari et al 2021). However, there has been no exploration of contrastive learning on multi-modal tasks in the multi-label scenario.…”

Section: Related Workmentioning

confidence: 99%

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

Peng,

Chen,

Shou

et al. 2024

AAAI

View full text Add to dashboard Cite

Multi-modal multi-label emotion recognition (MMER) aims to identify relevant emotions from multiple modalities. The challenge of MMER is how to effectively capture discriminative features for multiple labels from heterogeneous data. Recent studies are mainly devoted to exploring various fusion strategies to integrate multi-modal information into a unified representation for all labels. However, such a learning scheme not only overlooks the specificity of each modality but also fails to capture individual discriminative features for different labels. Moreover, dependencies of labels and modalities cannot be effectively modeled. To address these issues, this paper presents ContrAstive feature Reconstruction and AggregaTion (CARAT) for the MMER task. Specifically, we devise a reconstruction-based fusion mechanism to better model fine-grained modality-to-label dependencies by contrastively learning modal-separated and label-specific features. To further exploit the modality complementarity, we introduce a shuffle-based aggregation strategy to enrich co-occurrence collaboration among labels. Experiments on two benchmark datasets CMU-MOSEI and M3ED demonstrate the effectiveness of CARAT over state-of-the-art methods. Code is available at https://github.com/chengzju/CARAT.

show abstract

Section: Related Workmentioning

confidence: 99%

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

Peng,

Chen,

Shou

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…In the last two decades, several researchers proposed models for automatic emotion recognition from nonverbal cues such as voice activity [1], [2], [3], body motions [4], [5], [6], touch [7], as well as their combinations [8], [9]. However, the most often considered indicators of emotional states are facial expressions [10], [11], [12], [13].…”

Section: Introductionmentioning

confidence: 99%

“…The success of FER predominantly reckons on the supervised learning paradigm in which the data annotation is expensive and laborious. Importantly, obtaining highly reliable emotion labels is tough [8] since the perception of emotional expressions depends on several factors such as gender and culture [31]. There exists a few attempts to perform unsupervised learning: Xiao et al [32] apply Restricted Boltzmann Machines (RBMs), and Yu et al [33] use Cycle Generative Adversarial Network (CycleGAN), for this purpose.…”

Section: Introductionmentioning

confidence: 99%

“…It employs Unsupervised Feature Learning (also called Unsupervised Pre-training) [34] to address this challenge. The primary advantage of our approach is the elimination of a timeconsuming annotation process for feature learning [35], [8]. The proposed method leverages the reconstruction capability of a Convolutional Residual Autoencoder.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper, we particularly study the transferability of FER systems capable of recognizing emotions from F m and F um images. Unsupervised feature learning has the potential to provide a more robust adaptation to real-world applications due to the fact that it does not require (labeled) re-training when the domain changes [35], [8], [45]. In this regard, we investigate the following cross-dataset scenarios to evaluate whether:…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Unleashing the Transferability Power of Unsupervised Pre-Training for Emotion Recognition in Masked and Unmasked Facial Images

D’incà,

Beyan,

Niewiadomski

et al. 2023

IEEE Access

Self Cite

View full text Add to dashboard Cite

Facial expressions are an essential part of nonverbal communication and major indicators of human emotions. Effective automatic Facial Emotion Recognition (FER) systems can facilitate comprehension of an individual's intention, and prospective behaviors in Human-Computer and Human-Robot Interaction. However, FER faces an enduring challenge, commonly encountered in real-life, of partial occlusions caused by objects such as sunglasses and hands. With the onset of the COVID-19 pandemic, facial masks become a major obstruction for FER systems. The utilization of facial masks exacerbates the occlusion issue since these cover a significant portion of a person's face, including the highly informative mouth area from which positive and negative emotions can be differentiated. Conversely, the efficacy of FER is largely contingent upon the supervised learning paradigm, which necessitates costly and laborious data annotation. Our study centers on utilizing the reconstruction capability of a Convolutional Residual Autoencoder to differentiate between positive and negative emotions. The proposed approach employs unsupervised feature learning and takes as inputs facial images of individuals with masks and without masks. Our study puts particular emphasis on the transferability of the proposed approach to different domains in comparison to current state-of-the-art fully supervised methods. The comprehensive experimental evaluation demonstrates the superior transferability of the proposed approach, highlighting the effectiveness of the unsupervised feature learning pipeline. Despite outperforming more complex methods in some scenarios, the proposed approach is characterized by relatively low computational expense. The source code of the proposed approach, along with the facial images created for this study, will be publicly accessible following the acceptance of this paper.

show abstract

Learning ability of top university students in China: Shanghai Jiao Tong University as a case study

Wang

2022

SN Soc Sci

View full text Add to dashboard Cite

The network illustrates the interrelationships among diverse entities and has attracted considerable attention. A wide range of applications makes the network a popular modeling tool, including but not limited to social networks [1], business administration [2], and city construction [3]. In these fields, quantifying network value is essential. By examining network value, companies gain insights into consumer behavior, allowing them to refine marketing strategies accordingly [4]. Similarly, investigating network value can assist urban planners in better understanding the utilization and demands for public facilities, ultimately optimizing city construction [5]. To better understand network value, researchers have established some influential laws that describe network value in terms of the number of neighbors, edges, and subgraphs.Sarnoff 's Law. In the 1940s, David Sarnoff proposed that the value of a broadcasting network is directly proportional to the number of nodes (audience), which is also known as Sarnoff's law [6]. For example, in the case of TV programs, the network value increases linearly with the number of the audience, because the growing number of audiences allows advertisers to access more potential customers. This results in more advertisers, increased revenues, and higher broadcast media demand. Formally, for a network with n nodes, the communication value is Θ(n). This law was initially applied to the film industry, later extended to television, and usually represents one-way communication. Broadcasting can only transmit messages unidirectionally to users, but cannot spread information within users.Metcalfe's Law. With the conferral of the Turing Award on Robert Metcalfe in 2022, Metcalfe's law[7] has regained attention. Metcalfe's law was proposed on the background of the increasing number of Ethernet users and growing attention to the interconnection value of networks. Metcalfe argues that if the value of each node (terminal) is equal, the value of a network is proportional to the number of edges. One of the classic illustrations of Metcalfe's law lies in communication networks, where a network with n users can provide interconnection value proportional to approximately n 2 , i.e. Θ(n 2 ), as each user can communicate with the other n − 1 users in the network. Despite the many challenges associated with the development of the network, such as network scale, connection quality, and network design, Metcalfe's law remains applicable in some scenarios, especially in the field of cloud computing [8]. The increasing number of cloud computing users leads to the availability of more resources, which in turn attracts additional users, creating a virtuous cycle of growth. As a theory for describing the interconnection value of networks, Metcalfe's law is held in high regard and possesses significant implications for communication networks, the Internet, and social networks.Reed's Law. In 2000, David Reed proposed Reed's law [9] for group-forming networks (GFNs). He argued that network value depends not ...

show abstract

Multimodal Emotion Recognition with Modality-Pairwise Unsupervised Contrastive Loss

Cited by 11 publications

References 61 publications

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

CARAT: Contrastive Feature Reconstruction and Aggregation for Multi-Modal Multi-Label Emotion Recognition

Unleashing the Transferability Power of Unsupervised Pre-Training for Emotion Recognition in Masked and Unmasked Facial Images

Learning ability of top university students in China: Shanghai Jiao Tong University as a case study

Contact Info

Product

Resources

About