Fusing eye movements and observer narratives for expert-driven image-region annotations

Vaidyanathan, Preethi; Prud’hommeaux, Emily; Pelz, Jeff B.; Alm, Cecilia Ovesdotter; Haake, Anne R.

doi:10.1145/2857491.2857542

Cited by 4 publications

(14 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we describe our multimodal Spoken Narratives And Gaze ( snag ) dataset ( Vaidyanathan et al., 2018 ) that is used to evaluate the proposed framework. This dataset contains eye movements and spoken narratives co-captured from participants while viewing general domain images ( Figure 3) and has been released 2 to the research community.…”

Section: Multimodal Data Collectionmentioning

confidence: 99%

“…Reference alignments (ground truth) were prepared using a GUI called RegionLabeler 5 ( Vaidyanathan et al., 2018 ) to allow evaluation of the resulting multimodal alignments. This represented the manual alignments obtained by associating each fixation cluster in the case of MFSC and image segment in the case of image segmentation with its corresponding word tokens (linguistic units).…”

Section: Alignmentmentioning

confidence: 99%

“…This work integrates gaze and linguistic information indicating ‘what people look at’ and ‘what people say,’ to identify the objects and their corresponding names or labels in images. The data we collected ( Vaidyanathan et al., 2018 ) which has been released for research purposes and the code we developed for the framework (released in this work), allowed us to explore the following research questions: When a person views and describes an image, what relationship, if any, exists between the moment of fixation on an object and the moment the person utters the word or phrase to name that object? Can co-captured gaze and speech data be integrated automatically in order to identify and quantify this relationship?…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Computational framework for fusing eye movements and spoken narratives for image annotation

et al. 2020

Self Cite

View full text Add to dashboard Cite

Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important regions in an image. Our work relies on the notion that gaze and spoken narratives can jointly model how humans inspect and analyze images. Using an unsupervised bitext alignment algorithm originally developed for machine translation, we create meaningful mappings between participants’ eye movements over an image and their spoken descriptions of that image. The resulting multimodal alignments are then used to annotate image regions with linguistic labels. The accuracy of these labels exceeds that of baseline alignments obtained using purely temporal correspondence between fixations and words. We also find differences in system performances when identifying image regions using clustering methods that rely on gaze information rather than image features. The alignments produced by our framework can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. The framework can potentially be applied to any multimodal data stream and to any visual domain. To this end, we provide the research community with access to the computational framework.

show abstract

Section: Multimodal Data Collectionmentioning

confidence: 99%

Section: Alignmentmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Computational framework for fusing eye movements and spoken narratives for image annotation

et al. 2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…We examine the usefulness of our general-domain dataset on image-region annotation, adapting the framework given by Vaidyanathan et al (2016).…”

Section: Application To Multimodal Alignmentmentioning

confidence: 99%

“…Ho et al (2015) provide a dataset that consists only of gaze and speech time stamps during dyadic interactions. The closest dataset to ours is the multimodal but non-public data described by Vaidyanathan et al (2016).…”

Section: Related Workmentioning

confidence: 99%

SNAG: Spoken Narratives and Gaze Dataset

Vaidyanathan

Prud’hommeaux

Pelz

et al. 2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Self Cite

View full text Add to dashboard Cite

Humans rely on multiple sensory modalities when examining and reasoning over images. In this paper, we describe a new multimodal dataset that consists of gaze measurements and spoken descriptions collected in parallel during an image inspection task. The task was performed by multiple participants on 100 general-domain images showing everyday objects and activities. We demonstrate the usefulness of the dataset by applying an existing visual-linguistic data fusion framework in order to label important image regions with appropriate linguistic labels.

show abstract

Team-based, transdisciplinary, and inclusive practices for undergraduate research

Alm

Bailey

2017

2017 IEEE Frontiers in Education Conference (FIE)

View full text Add to dashboard Cite

Fusing eye movements and observer narratives for expert-driven image-region annotations

Cited by 4 publications

References 35 publications

Computational framework for fusing eye movements and spoken narratives for image annotation

Computational framework for fusing eye movements and spoken narratives for image annotation

SNAG: Spoken Narratives and Gaze Dataset

Team-based, transdisciplinary, and inclusive practices for undergraduate research

Contact Info

Product

Resources

About