2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.648
|View full text |Cite
|
Sign up to set email alerts
|

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

Abstract: The attention mechanisms in deep neural networks are inspired by human's attention that sequentially focuses on the most relevant parts of the information over time to generate prediction output. The attention parameters in those models are implicitly trained in an end-to-end manner, yet there have been few trials to explicitly incorporate human gaze tracking to supervise the attention models. In this paper, we investigate whether attention models can benefit from explicit human gaze labels, especially for the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
36
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 65 publications
(36 citation statements)
references
References 25 publications
0
36
0
Order By: Relevance
“…On one hand, for VQA [4,40] the authors point out that the attention model does not attend to same regions as humans and adding attention supervision barely helps the performance. On the other hand, adding supervision to feature map attention [15,38] was found to be beneficial. We noticed in our preliminary experiments that directly guiding the region attention with supervision [16] does not necessary lead to improvements in automatic sentence metrics.…”
Section: Related Workmentioning
confidence: 99%
“…On one hand, for VQA [4,40] the authors point out that the attention model does not attend to same regions as humans and adding attention supervision barely helps the performance. On the other hand, adding supervision to feature map attention [15,38] was found to be beneficial. We noticed in our preliminary experiments that directly guiding the region attention with supervision [16] does not necessary lead to improvements in automatic sentence metrics.…”
Section: Related Workmentioning
confidence: 99%
“…Recently there is great interest in joint vision-language tasks, e.g. captioning [63,28,9,25,2,70,62,61,73,68,13,47,55,8,32], visual question answering [3,72,41,69,56,66,59,76,77,19,64,24,60], and cross-domain retrieval [6,5,74,36]. These often rely on learned image-text embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…RT-BENE has many possible application areas. Within the computer vision community, gaze has recently been used for visual attention estimation [9], saliency estimation [30] and labelling in the context of video captioning [47]. We argue that incorporation of blinks will further improve performance in these tasks, as it is well known from studies with humans that blinks impact task performance in atten- tion and workload estimation [2] and can be used for user modelling [17].…”
Section: Introductionmentioning
confidence: 99%