Fine-Tuning Swin Transformer and Multiple Weights Optimality-Seeking for Facial Expression Recognition

Feng, Hong-Qi; Huang, Weikai; Zhang, Denghui; Zhang, Bangze

doi:10.1109/access.2023.3237817

Cited by 11 publications

(4 citation statements)

References 34 publications

(45 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For these comparisons, we drew from an array of notable works, namely LibreFace [65], SSA-ICL [87], ECAN [88], A-MobileNet [89], DNFER [90], Muhamad et al [91], Xiaoyu et al [92], and NCCTFER [93]. Furthermore, we considered additional works such as FST-MWOS [94] and Sunyoung et al for the FER2013+ dataset. We utilized accuracy as our primary evaluation metric, with the results showcased in Tables 9 and 10.…”

Section: B Facial Expression Recognition Results On Fer2013+ and Raf-...mentioning

confidence: 99%

Attention-Enabled Lightweight Neural Network Architecture for Detection of Action Unit Activation

Deramgozin,

Jovanovic,

Arevalillo-Herráez

et al. 2023

IEEE Access

View full text Add to dashboard Cite

Facial Action Unit (AU) detection is of major importance in a broad range of artificial intelligence applications such as healthcare, Facial Expression Recognition (FER), and mental state analysis. In this paper, we present an innovative, resource-efficient facial AU detection model, embedding both spatial and channel attention mechanisms into a convolutional neural network (CNN). Along with a unique data input system leveraging image data and binary-encoded AU activation labels, our model enhances AU detection capabilities while simultaneously offering interpretability for FER systems. In contrast to existing state-of-the-art models, our proposal's streamlined architecture, combined with superior performance, establishes it as an ideal solution for resource-limited environments like mobile and embedded systems with computational constraints. The model was trained and evaluated utilizing the BP4D, CK+, DISFA, FER2013+, and RAF-DB datasets, with the latter two being particularly significant as they represent wild datasets for facial expression recognition. These datasets encompass ground truth emotions matched with corresponding AU activations according to the Facial Action Coding System. Various metrics, including F1 score, accuracy, and Euclidean distance, showcase its effectiveness in AU detection and interpretability.

show abstract

Section: B Facial Expression Recognition Results On Fer2013+ and Raf-...mentioning

confidence: 99%

Attention-Enabled Lightweight Neural Network Architecture for Detection of Action Unit Activation

Deramgozin,

Jovanovic,

Arevalillo-Herráez

et al. 2023

IEEE Access

View full text Add to dashboard Cite

show abstract

“…The classifiers are random forest (RF), logistic regression (LR), support vector machine (SVM), CNN, LSTM, and Bi-LSTM. The reason why we chose these ML and DL models is that they presented a significant performance in similar NLP and text mining tasks ( Malik et al, 2023 ; Rehan, Malik & Jamjoom, 2023 ). The following comparable models are designed:…”

Section: Methodsmentioning

confidence: 99%

Categorization of tweets for damages: infrastructure and human damage assessment using fine-tuned BERT model

Malik,

Younas,

Jamjoom

et al. 2024

PeerJ Computer Science

View full text Add to dashboard Cite

Identification of infrastructure and human damage assessment tweets is beneficial to disaster management organizations as well as victims during a disaster. Most of the prior works focused on the detection of informative/situational tweets, and infrastructure damage, only one focused on human damage. This study presents a novel approach for detecting damage assessment tweets involving infrastructure and human damages. We investigated the potential of the Bidirectional Encoder Representations from Transformer (BERT) model to learn universal contextualized representations targeting to demonstrate its effectiveness for binary and multi-class classification of disaster damage assessment tweets. The objective is to exploit a pre-trained BERT as a transfer learning mechanism after fine-tuning important hyper-parameters on the CrisisMMD dataset containing seven disasters. The effectiveness of fine-tuned BERT is compared with five benchmarks and nine comparable models by conducting exhaustive experiments. The findings show that the fine-tuned BERT outperformed all benchmarks and comparable models and achieved state-of-the-art performance by demonstrating up to 95.12% macro-f1-score, and 88% macro-f1-score for binary and multi-class classification. Specifically, the improvement in the classification of human damage is promising.

show abstract

“…In [25], two Transformer network frameworks are designed to extract facial information and motion information of face images and the features obtained from both are combined for classification. The study in [26] uses multiple Swin Transformers in parallel to obtain different weights by modifying the hyper parameters to capture different facial information for better discrimination of facial expression. In [27], the same image is divided into two patches with different sizes, each of which is fed into a separate Transformer network to extract features, and the information at different scales is fused using crossattention to achieve a competitive result.…”

Section: Related Workmentioning

confidence: 99%

Dual-Branch Cross-Attention Network for Micro-Expression Recognition with Transformer Variants

Xie,

Zhao

2024

Electronics

View full text Add to dashboard Cite

A micro-expression (ME), as a spontaneous facial expression, usually occurs instantaneously and is difficult to disguise after an emotion-evoking event. Numerous convolutional neural network (CNN)-based models have been widely explored to recognize MEs for their strong local feature representation ability on images. However, the main drawback of the current methods is their inability to fully extracting holistic contextual information from ME images. To achieve efficient ME learning representation from diverse perspectives, this paper uses Transformer variants as the main backbone and the dual-branch architecture as the main framework to extract meaningful multi-modal contextual features for ME recognition (MER). The first branch leverages an optical flow operator to facilitate the motion information extraction between ME sequences, and the corresponding optical flow maps are fed into the Swin Transformer to acquire motion–spatial representation. The second branch directly sends the apex frame in one ME clip to Mobile ViT (Vision Transformer), which can capture the local–global features of MEs. More importantly, to achieve the optimal feature stream fusion, a CAB (cross attention block) is designed to interact the feature extracted by each branch for adaptive learning fusion. The extensive experimental comparisons on three publicly available ME benchmarks show that the proposed method outperforms the existing MER methods and achieves an accuracy of 81.6% on the combined database.

show abstract

Fine-Tuning Swin Transformer and Multiple Weights Optimality-Seeking for Facial Expression Recognition

Cited by 11 publications

References 34 publications

Attention-Enabled Lightweight Neural Network Architecture for Detection of Action Unit Activation

Attention-Enabled Lightweight Neural Network Architecture for Detection of Action Unit Activation

Categorization of tweets for damages: infrastructure and human damage assessment using fine-tuned BERT model

Dual-Branch Cross-Attention Network for Micro-Expression Recognition with Transformer Variants

Contact Info

Product

Resources

About