KFSENet: A Key Frame-Based Skeleton Feature Estimation and Action Recognition Network for Improved Robot Vision with Face and Emotion Recognition

Le, Dinh-Son; Phan, Hai-Hong; Ha, Huy Hung; Tran, Van-An; Nguyen, The-Hung; Nguyen, Dinh Quan

doi:10.3390/app12115455

Cited by 8 publications

(9 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is concluded that key-frame extraction is highly significant in processing video data. In general, existing key-frame extraction methods consist of shot boundary detection (Fei et al , 2017; Mehmood et al , 2016), frame image clustering (Wu et al , 2017; Gharbi et al , 2017), motion analysis (Le et al , 2022; Anderson and McOwan, 2006) and visual content analysis (Panagiotakis et al , 2009).…”

Section: Related Workmentioning

confidence: 99%

“…Key-frame extraction aims to extract a set of images from original video, which are expected to be an approximate representation of the visual contents of the entire video (Huang and Wang, 2019). Traditional key-frame extraction methods consist of shot boundary detection (Fei et al , 2017; Mehmood et al , 2016), frame image clustering (Wu et al , 2017; Gharbi et al , 2017), motion analysis (Le et al , 2022; Anderson and McOwan, 2006) and visual content analysis (Panagiotakis et al , 2009). The shot boundary-based methods are simple and computationally efficient, but they can only select a fixed number of images as key-frames without considering the content complexity (Fei et al , 2017; Mehmood et al , 2016).…”

Section: Introductionmentioning

confidence: 99%

“…Clustering-based methods are very popular, but they need to preestablish the number of clusters and cannot preserve the sequential information of key-frames (Wu et al , 2017; Gharbi et al , 2017). The motion-based methods, which are mainly designed based on the calculation of the optical flow, not only ignore the cumulative changes in content but also suffer from high computation and weak robustness (Le et al , 2022; Anderson and McOwan, 2006). The visual content-based methods mainly depend on the changes of content (such as texture and color) to extract discriminative frames (Panagiotakis et al , 2009), which is simple but effective in extracting key-frames (Yong et al , 2013).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Yang,

Li,

2023

DTA

View full text Add to dashboard Cite

PurposeAlthough numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations.Design/methodology/approachA novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations.FindingsExtensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance.Originality/valueThe proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Yang,

Li,

2023

DTA

View full text Add to dashboard Cite

show abstract

“…The keyframe is the frame with the highest histogram correlation from a set of consecutive frames. Choosing the number of consecutive frames (shot) affects the keyframe extraction accuracy [74,75], where If the consecutive frames are small sets of frames the variation of frame histograms will be very small to identify one keyframe, and if we use a large set of consecutive frames there could be more than one keyframe in the shot and we only extract one and neglect others. We explore the keyframe extraction method for a variant video clip length: 5 frames per clip and 16 frames per clip.…”

Section: The Image-based Model (R2d-lstm)mentioning

confidence: 99%

Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D-3D Residual networks for human action recognition

Yosry,

elrefaei,

Ziedan

2023

Preprint

View full text Add to dashboard Cite

Human action recognition has been identified as an important research topic in computer vision because it is an essential form of communication and interplay between computers and humans. To assist computers in automatically recognizing human behaviors and accurately comprehending human intentions. Inspired by some keyframe extraction and multifeatured fusion research, this paper improved the accuracy of action recognition by utilizing keyframe features and fusing them with video features. In this article, we suggest a novel multi-stream approach architecture made up of two distinct models fused using different fusion techniques. The first model combines convolutional neural networks in two dimensions (2D-CNN) with Long-Short Term Memory (LSTM) networks to glean long-term spatial and temporal features from video keyframe images for human action recognition. The second model is a 3-dimensional convolutional neural network (3D-CNN) that gathers quick spatial-temporal features from video clips. Next, we use Early and Late Fusion techniques for the two different models to recognize human action from video. The HMDB-51 and UCF-101 datasets, two important action recognition benchmarks, were used to test our method. When applied to the HMDB-51 dataset and the UCF-101 dataset, the Early-Fusion (EF) strategy had an accuracy of 70.2% and 95.5%, respectively, while the Late-Fusion (LF) strategy had an accuracy of 77.2% and 97.5%, respectively.

show abstract

“…However, emotion classification is still a challenging task. Convolutional neural networks (CNNs) perform face normalization, facial expressions, and emotional classification using real images as their main functions and are frequently adopted and used in computer vision applications [3][4][5][6]. The accuracy of the CNN-based emotion classification system has been improved through pre-or post-processing [2,5,7] and the development of new algorithms in the architecture.…”

Section: Introductionmentioning

confidence: 99%

Facial Emotion Recognition Analysis Based on Age-Biased Data

et al. 2022

View full text Add to dashboard Cite

This paper aims to analyze the importance of age-biased data in recognizing six emotions using facial expressions. For this purpose, a custom dataset (adults, kids, mixed) was constructed using images that separated the existing datasets (FER2013 and MMA FACILE EXPRESSION) into adults (≥14) and kids (≤13). The convolutional Neural Networks (CNN) algorithm was used to calculate emotion recognition accuracy. Additionally, this study investigated the effect of the characteristics of CNN architecture on emotion recognition accuracy. Based on the variables of Accuracy and FLOP, three types of CNN architectures (MobileNet-V2, SE-ResNeXt50 (32 × 4 d), and ResNeXt-101 (64 × 4 d)) were adopted. As for the experimental result, SE-ResNeXt50 (32 × 4 d) showed the highest accuracy at 79.42%, and the model that learned by age obtained 22.24% higher accuracy than the model that did not learn by age. In the results, the difference in expression between adults and kids was greatest for fear and neutral emotions. This study presented valuable results on age-biased learning data and algorithm type effect on emotion recognition accuracy.

show abstract

KFSENet: A Key Frame-Based Skeleton Feature Estimation and Action Recognition Network for Improved Robot Vision with Face and Emotion Recognition

Cited by 8 publications

References 34 publications

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Analyzing audiovisual data for understanding user's emotion in human−computer interaction environment

Various frameworks for integrating image and video streams for spatiotemporal information learning employing 2D-3D Residual networks for human action recognition

Facial Emotion Recognition Analysis Based on Age-Biased Data

Contact Info

Product

Resources

About