2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01249
|View full text |Cite
|
Sign up to set email alerts
|

Connecting What to Say With Where to Look by Modeling Human Attention Traces

Abstract: We introduce a unified framework to jointly model images, text, and human attention traces. Our work is built on top of the recent Localized Narratives annotation framework [30], where each word of a given caption is paired with a mouse trace segment. We propose two novel tasks:(1) predict a trace given an image and caption (i.e., visual grounding), and (2) predict a caption and a trace given only an image. Learning the grounding of each word is challenging, due to noise in the human-provided traces and the pr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(8 citation statements)
references
References 35 publications
(66 reference statements)
0
8
0
Order By: Relevance
“…Narrative annotation focuses on the description of the relationship between entities, and entity relationships are collected during the annotation phase. Attributes, relationships, and entities in the same image are often closely related (29)(30)(31)(32). Localized Narratives (30) connect vision and language by artificially using mouse scribing to join action connections between entities and make the captioning in content more hierarchical.…”
Section: Narrative Annotation Modelmentioning
confidence: 99%
“…Narrative annotation focuses on the description of the relationship between entities, and entity relationships are collected during the annotation phase. Attributes, relationships, and entities in the same image are often closely related (29)(30)(31)(32). Localized Narratives (30) connect vision and language by artificially using mouse scribing to join action connections between entities and make the captioning in content more hierarchical.…”
Section: Narrative Annotation Modelmentioning
confidence: 99%
“…In order to generate the word-to-box alignment from the provided gaze trace points, our model divides the trace into several boxes, each box associated with a word, and generates a bounding box aligned on the axis of the gaze points. This trace transformation is performed inspired by the mouse trace analysis [17]. After that we extract three kinds of features, visual, captioning and gaze features.…”
Section: Preprocessing For Model Trainingmentioning
confidence: 99%
“…After that we extract three kinds of features, visual, captioning and gaze features. For calculating visual features, we use pre-trained Faster R-CNN [17] provided by detectron2 [18] to compute the visual features of the detected regions. Next, for calculating captioning features, we sum up the positional embeddings and the word embeddings via LXMERT proposed in the previous method [19].…”
Section: Preprocessing For Model Trainingmentioning
confidence: 99%
See 2 more Smart Citations