Supervising Neural Attention Models for Video Captioning by Human Gaze Data

Yu, Yang; Choi, Jong-Wook; Kim, Yeon-Hwa; Yoo, Kyung; Lee, Sang-Hun; Kim, Gun-Hee

doi:10.1109/cvpr.2017.648

Cited by 65 publications

(36 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On one hand, for VQA [4,40] the authors point out that the attention model does not attend to same regions as humans and adding attention supervision barely helps the performance. On the other hand, adding supervision to feature map attention [15,38] was found to be beneficial. We noticed in our preliminary experiments that directly guiding the region attention with supervision [16] does not necessary lead to improvements in automatic sentence metrics.…”

Section: Related Workmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

161

145

View full text Add to dashboard Cite

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate plausible sentences that are based on priors but are not necessarily grounded in the video. In this work, we explicitly link the sentence to the evidence in the video by annotating each noun phrase in a sentence with the corresponding bounding box in one of the frames of a video. Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase. This allows training video description models with this data, and importantly, evaluate how grounded or "true" such model are to the video they describe. To generate grounded captions, we propose a novel video description model which is able to exploit these bounding box annotations. We demonstrate the effectiveness of our model on our dataset, but also show how it can be applied to image description on the Flickr30k Entities dataset. We achieve state-of-the-art performance on video description, video paragraph description, and image description and demonstrate our generated sentences are better grounded in the video.

show abstract

Section: Related Workmentioning

confidence: 99%

Grounded Video Description

Zhou

Kalantidis

Chen

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

161

145

View full text Add to dashboard Cite

show abstract

“…Recently there is great interest in joint vision-language tasks, e.g. captioning [63,28,9,25,2,70,62,61,73,68,13,47,55,8,32], visual question answering [3,72,41,69,56,66,59,76,77,19,64,24,60], and cross-domain retrieval [6,5,74,36]. These often rely on learned image-text embeddings.…”

Section: Related Workmentioning

confidence: 99%

ADVISE: Symbolism and External Knowledge for Decoding Advertisements

Kovashka

2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

In order to convey the most content in their limited space, advertisements embed references to outside knowledge via symbolism. For example, a motorcycle stands for adventure (a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative property to dissuade viewers from undesirable behaviors). We show how to use symbolic references to better understand the meaning of an ad. We further show how anchoring ad understanding in general-purpose object recognition and image captioning improves results. We formulate the ad understanding task as matching the ad image to human-generated statements that describe the action that the ad prompts, and the rationale it provides for taking this action. Our proposed method outperforms the state of the art on this task, and on an alternative formulation of question-answering on ads. We show additional applications of our learned representations for matching ads to slogans, and clustering ads according to their topic, without extra training.

show abstract

“…RT-BENE has many possible application areas. Within the computer vision community, gaze has recently been used for visual attention estimation [9], saliency estimation [30] and labelling in the context of video captioning [47]. We argue that incorporation of blinks will further improve performance in these tasks, as it is well known from studies with humans that blinks impact task performance in atten- tion and workload estimation [2] and can be used for user modelling [17].…”

Section: Introductionmentioning

confidence: 99%

RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments

Cortacero

Fischer

Demiris

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

In recent years gaze estimation methods have made substantial progress, driven by the numerous application areas including human-robot interaction, visual attention estimation and foveated rendering for virtual reality headsets. However, many gaze estimation methods typically assume that the subject's eyes are open; for closed eyes, these methods provide irregular gaze estimates. Here, we address this assumption by first introducing a new open-sourced dataset with annotations of the eye-openness of more than 200,000 eye images, including more than 10,000 images where the eyes are closed. We further present baseline methods that allow for blink detection using convolutional neural networks. In extensive experiments, we show that the proposed baselines perform favourably in terms of precision and recall. We further incorporate our proposed RT-BENE baselines in the recently presented RT-GENE gaze estimation framework where it provides a real-time inference of the openness of the eyes. We argue that our work will benefit both gaze estimation and blink estimation methods, and we take steps towards unifying these methods.

show abstract

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

Cited by 65 publications

References 25 publications

Grounded Video Description

Grounded Video Description

ADVISE: Symbolism and External Knowledge for Decoding Advertisements

RT-BENE: A Dataset and Baselines for Real-Time Blink Estimation in Natural Environments

Contact Info

Product

Resources

About