2021
DOI: 10.48550/arxiv.2107.03465
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild

Abstract: In this work we tackle the task of video-based audiovisual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Poor illumination conditions, head/body orientation and low image resolution constitute factors that can potentially hinder performance in case of methodologies that solely rely on the extraction and analysis of facial features. In order to alleviate this problem, we leverage bodily as well as contextual features, as part of a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…Since, we have trained model for single task and not used any of audio and video features, so performance is not as good as teams using multi-task learning with video features. [27] 0.29 0.6491 0.4082 NTUA-CVSP [28] 0.3367 0.6418 0.4374 Morphoboid [29] 0.3511 0.668 0.4556 FLAB2021 [30] 0.4079 0.6729 0.4953 STAR [31] 0.4759 0.7321 0.5604 Maybe Next Time [32] 0.6046 0.7289 0.6456 CPIC-DIR2021 [33] 0.6834 0.7709 0.7123 Netease Fuxi Virtual Human [34] 0.763 0.8059 0.7777 Ours [18] 0.361 0.675 0.4646 Table 3 shows the influence of number of networks that are collaboratively trained in CCT. It can be observed that model with 3 networks performs the best in the presence of noise.…”
Section: Performance Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Since, we have trained model for single task and not used any of audio and video features, so performance is not as good as teams using multi-task learning with video features. [27] 0.29 0.6491 0.4082 NTUA-CVSP [28] 0.3367 0.6418 0.4374 Morphoboid [29] 0.3511 0.668 0.4556 FLAB2021 [30] 0.4079 0.6729 0.4953 STAR [31] 0.4759 0.7321 0.5604 Maybe Next Time [32] 0.6046 0.7289 0.6456 CPIC-DIR2021 [33] 0.6834 0.7709 0.7123 Netease Fuxi Virtual Human [34] 0.763 0.8059 0.7777 Ours [18] 0.361 0.675 0.4646 Table 3 shows the influence of number of networks that are collaboratively trained in CCT. It can be observed that model with 3 networks performs the best in the presence of noise.…”
Section: Performance Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…The third ABAW Competition, to be held in conjunction with the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2022 is a continuation of the first [24] and second [32] ABAW Competitions held in conjunction with the IEEE Conference on Face and Gesture Recognition (IEEE FG) 2021 and with the International Conference on Computer Vision (ICCV) 2022, respectively, which targeted dimensional (in terms of valence and arousal) [2][3][4]8,9,11,21,35,39,47,48,50,[54][55][56], categorical (in terms of the basic expressions) [12,15,16,33,36,37,51] and facial action unit analysis and recognition [7,19,20,25,26,40,44,47]. The third ABAW Competition contains four Challenges, which are based on the same in-the-wild database, (i) the uni-task Valence-Arousal Estimation Challenge; (ii) the uni-task Expression Classification Challenge (for the 6 basic expressions plus the neutral state plus the 'other' category that denotes expressions/affective states other than the 6 basic ones); (iii) the uni-task Action Unit Detection Challenge (for 12 action units); (iv) the Multi-Task Learning Challenge (for joint learning and predicting of valence-arousal, 8 expressions -6 basic plus neutral plus 'other'-and 12 action units).…”
Section: Introductionmentioning
confidence: 99%