A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition

Liang, Xingcan; Xu, Li; Zhang, Wenxiang; Zhang, Yan; Liu, Jinfu; Liu, Zhipeng

doi:10.1007/s00371-022-02413-5

Cited by 25 publications

(13 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fan et al [ 33 ] modeled a hierarchical scale network (HSNet), in which the scale information of facial expression images was enhanced by a dilation convolution block. In [ 34 ], a dual-branch network was projected with one branch using CNN to capture local marginal information and the other applying a visual transformer to obtain compact global representation. Wang et al constructed an architecture similar to U-Net as an attention branch to highlight subtle local facial expression information [ 35 ].…”

Section: Related Workmentioning

confidence: 99%

Facial Expression Recognition: One Attention-Modulated Contextual Spatial Information Network

Zhu

Zhou

2022

Entropy

View full text Add to dashboard Cite

Facial expression recognition (FER) in the wild is a challenging task due to some uncontrolled factors such as occlusion, illumination, and pose variation. The current methods perform well in controlled conditions. However, there are still two issues with the in-the-wild FER task: (i) insufficient descriptions of long-range dependency of expression features in the facial information space and (ii) not finely refining subtle inter-classes distinction from multiple expressions in the wild. To overcome the above issues, an end-to-end model for FER, named attention-modulated contextual spatial information network (ACSI-Net), is presented in this paper, with the manner of embedding coordinate attention (CA) modules into a contextual convolutional residual network (CoResNet). Firstly, CoResNet is constituted by arranging contextual convolution (CoConv) blocks of different levels to integrate facial expression features with long-range dependency, which generates a holistic representation of spatial information on facial expression. Then, the CA modules are inserted into different stages of CoResNet, at each of which the subtle information about facial expression acquired from CoConv blocks is first modulated by the corresponding CA module across channels and spatial locations and then flows into the next layer. Finally, to highlight facial regions related to expression, a CA module located at the end of the whole network, which produces attentional masks to multiply by input feature maps, is utilized to focus on salient regions. Different from other models, the ACSI-Net is capable of exploring intrinsic dependencies between features and yielding a discriminative representation for facial expression classification. Extensive experimental results on AffectNet and RAF_DB datasets demonstrate its effectiveness and competitiveness compared to other FER methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Facial Expression Recognition: One Attention-Modulated Contextual Spatial Information Network

Zhu

Zhou

2022

Entropy

View full text Add to dashboard Cite

show abstract

“…Facial expression recognition is a hot topic in computer vision, with a wide range of applications including human behaviour analysis, mental disorder identification, and human-computer interaction, to name a few. Most recent research [1], [2], and [3][4][5][6][7] has concentrated on developing deep ANNs to achieve cutting-edge outcomes. Even though handcrafted feature-based artificial neural network models [8] and [9] provide results that are less accurate than deep learning networks, they have attracted less attention.…”

Section: Introductionmentioning

confidence: 99%

“…Various methods are employed to identify emotions based on face traits. This manuscript examines many recent investigations into the automatic data-driven technique [3][4][5][6][7] and the handcrafted approach [1][2] to facial emotion recognition. In the most difficult real-world dataset, FER-2013, these approaches have computationally complex solutions that give good accuracy while training and testing on the same datasets.…”

Section: Introductionmentioning

confidence: 99%

“…On the FER-2013 Challenge dataset [13], the FERG dataset [15], and the CK+ dataset [16], we compare our proposed models with recent and relevant state-of-the-art approaches [5][6][7]17]. Deep learning has recently been used to extract and train numerous features for a good FER system, notably convolutional neural networks (CNNs) [3][4][5][6][7]. However, many of the signals for facial emotions come from a few parts of the face, such as the eyes and lips, while other parts of the face, such as the hairs and ears, play minor roles in the output.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Enhancing Feature Extraction Technique Through Spatial Deep Learning Model for Facial Emotion Detection

Khan¹,

Singh²,

Agrawal³

2023

AETiC

View full text Add to dashboard Cite

Automatic facial expression analysis is a fascinating and difficult subject that has implications in a wide range of fields, including human–computer interaction and data-driven approaches. Based on face traits, a variety of techniques are employed to identify emotions. This article examines various recent explorations into automatic data-driven approaches and handcrafted approaches for recognising face emotions. These approaches offer computationally complex solutions that provide good accuracy when training and testing are conducted on the same datasets, but they perform less well on the most difficult realistic dataset, FER-2013. The article's goal is to present a robust model with lower computational complexity that can predict emotion classes more accurately than current methods and aid society in finding a realistic, all-encompassing solution for the facial expression system. A crucial step in good facial expression identification is extracting appropriate features from the face images. In this paper, we examine how well-known deep learning techniques perform when it comes to facial expression recognition and propose a convolutional neural network-based enhanced version of a spatial deep learning model for the most relevant feature extraction with less computational complexity. That gives a significant improvement on the most challenging dataset, FER-2013, which has the problems of occlusions, scale, and illumination variations, resulting in the best feature extraction and classification and maximizing the accuracy, i.e., 74.92%. It also maximizes the correct prediction of emotions at 99.47%, and 98.5% for a large number of samples on the CK+ and FERG datasets, respectively. It is capable of focusing on the major features of the face and achieving greater accuracy over previous fashions.

show abstract

“…To alleviate the effectiveness of occlusions and head-pose variants, Liang et al 29 proposed a robust convolution-transformer dual branch network (CT-DBN) which can model local and global facial information for FER. However, these models only use single-modal information.…”

Section: Introductionmentioning

confidence: 99%

A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition

Xu,

Du,

Wang

et al. 2023

Computational Intelligence

View full text Add to dashboard Cite

Emotional recognition in conversations (ERC) is increasingly being applied in various IoT devices. Deep learning‐based multimodal ERC has achieved great success by leveraging diverse and complementary modalities. Although most existing methods try to adopt attention mechanisms to fuse different information, these methods ignore the complementarity between modalities. To this end, the joint cross‐attention model is introduced to alleviate this issue. However, multi‐scale feature information on different modalities is not utilized. Moreover, the context relationship plays an important role in feature extraction in the expression recognition task. In this paper, we propose a novel joint hierarchical graph convolution network (JHGCN) which exploits different layer features and context relationships for facial expression recognition based on audio‐visual (A‐V) information. Specifically, we adopt different deep networks to extract features from different modalities individually. For V modality, we construct V graph data based on patch embeddings which are extracted from the transformer encoder. Moreover, we embed the graph convolution which can leverage the intra‐modality relationships with the transformer encoder. Then, the deep feature from different layers is fed to the hierarchical fusion module to enhance feature representation. At last, we use the joint cross‐attention mechanism to exploit the complementary inter‐modality relationships. To validate the proposed model, we have conducted various experiments on the AffWild2 and CMU‐MOSI datasets. All results confirm that our proposed model achieves highly promising performance compared to the joint cross‐attention model and other methods.

show abstract

A convolution-transformer dual branch network for head-pose and occlusion facial expression recognition

Cited by 25 publications

References 46 publications

Facial Expression Recognition: One Attention-Modulated Contextual Spatial Information Network

Facial Expression Recognition: One Attention-Modulated Contextual Spatial Information Network

Enhancing Feature Extraction Technique Through Spatial Deep Learning Model for Facial Emotion Detection

A joint hierarchical cross‐attention graph convolutional network for multi‐modal facial expression recognition

Contact Info

Product

Resources

About