Who’s Waldo? Linking People Across Text and Images

Cui, Claire; Khandelwal, Apoorv; Artzi, Yoav; Snavely, Noah; Averbuch‐Elor, Hadar

doi:10.1109/iccv48922.2021.00141

Cited by 8 publications

(12 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We automatically curate data from WC to construct FTT (Figure 2) as follows: (1) The “People by name” category on WC contains 407K distinct people identities. We query each identity's hierarchy of people‐centric subcategories (similar to [CKA*21]) and organize retrieved images by identity. (2) We use a Faster R‐CNN model [RHGS15; JL17; RR17] trained on the WIDER Face dataset [YLLT16] as a face detector.…”

Section: The Faces Through Time Datasetmentioning

confidence: 99%

What's in a Decade? Transforming Faces Through Time

Chen

Sun²,

Khandelwal

et al. 2023

Computer Graphics Forum

Self Cite

View full text Add to dashboard Cite

How can one visually characterize photographs of people over time? In this work, we describe the Faces Through Time dataset, which contains over a thousand portrait images per decade from the 1880s to the present day. Using our new dataset, we devise a framework for resynthesizing portrait images across time, imagining how a portrait taken during a particular decade might have looked like had it been taken in other decades. Our framework optimizes a family of per‐decade generators that reveal subtle changes that differentiate decades—such as different hairstyles or makeup—while maintaining the identity of the input portrait. Experiments show that our method can more effectively resynthesizing portraits across time compared to state‐of‐the‐art image‐to‐image translation methods, as well as attribute‐based and language‐guided portrait editing models. Our code and data will be available at facesthroughtime.github.io.

show abstract

Section: The Faces Through Time Datasetmentioning

confidence: 99%

What's in a Decade? Transforming Faces Through Time

Chen

Sun²,

Khandelwal

et al. 2023

Computer Graphics Forum

Self Cite

View full text Add to dashboard Cite

show abstract

“…Person-Centric Vision-Language Task The person-centric vision-language task (Zellers et al, 2019;Dong et al, 2022;Cui et al, 2021;You et al, 2022), is mainly based on grounding references to a person; therefore, person-centric visual grounding ability is a crucial component. VCR (Zellers et al, 2019) is a task that answers commonsensical questions about the people depicted in an image.…”

Section: Related Workmentioning

confidence: 99%

“…The person-centric visual grounding task (Cui et al, 2021) aims to predict a mentioned person, given an image and a contextual textual description. The person-centric commonsense grounding task (You et al, 2022), which extends a person-centric visual grounding task to a commonsense domain, is designed to identify the person mentioned in the commonsense description in the image.…”

Section: Related Workmentioning

confidence: 99%

Examining Consistency of Visual Commonsense Reasoning based on Person Grounding

Kim,

Kang,

Lee

2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

Given an image depicting multiple individuals, humans are capable of inferring each individual's emotions, intentions, and social norms based on commonsense understanding. However, a machine's ability of commonsense reasoning about distinct individuals in images remains underexplored. In this study, we examine the consistency of visual commonsense reasoning based on person grounding. We introduce a novel test dataset called Visual Commonsense Reasoning-Contrast Sets (VCR-CS) to evaluate whether models can reason about individual people in an image by changing the person tags in the questions and answers. We benchmark various vision-language models on VCR-CS and observe that they fail in consistent commonsense reasoning about different people in one image, showing a performance decrease of up to 31.5%. To mitigate such failures, we propose a multi-task learning framework called Personcentric groundIng eNhanced Tuning (PINT). Our framework enhances a model's ability to perform person-grounded commonsense reasoning by leveraging two novel person-centric pretraining tasks: Image Person-based Text Matching and Person-Masked Language Modeling. The experimental results revealed the effectiveness of PINT by showing the lowest performance degradation on VCR-CS and the improvements in consistency and sensitivity metrics. Our dataset and code are publicly available 1 .

show abstract

“…MCR has recently gained increasing attention, with several notable studies (Ramanathan et al, 2014;Huang et al, 2018;Cui et al, 2021;Parcalabescu et al, 2021;Guo et al, 2022;Goel et al, 2022;Hong et al, 2023). However, many of them focus on images with simple short sentences, such as 'A woman is driving a motorcycle.…”

Section: Introductionmentioning

confidence: 99%

“…Is she wearing a helmet?' Parcalabescu et al, 2021), or are limited to identifying movie characters or people (Ramanathan et al, 2014;Cui et al, 2021). More recently, Goel et al (2022) introduced a challenging and unconstrained MCR problem (see Figure 1) including a dataset, Coreferenced Image Narratives (CIN), with both people and objects as referents with long textual descriptions (narrations).…”

Section: Introductionmentioning

confidence: 99%

Semi-supervised multimodal coreference resolution in image narrations

Goel,

Fernando,

Keller

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

In this paper, we study multimodal coreference resolution, specifically where a longer descriptive text, i.e., a narration is paired with an image. This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets. To tackle these challenges, we present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context. Our approach incorporates losses for both labeled and unlabeled data within a crossmodal framework. Our evaluation shows that the proposed approach outperforms strong baselines both quantitatively and qualitatively, for the tasks of coreference resolution and narrative grounding.

show abstract

Who’s Waldo? Linking People Across Text and Images

Cited by 8 publications

References 44 publications

What's in a Decade? Transforming Faces Through Time

What's in a Decade? Transforming Faces Through Time

Examining Consistency of Visual Commonsense Reasoning based on Person Grounding

Semi-supervised multimodal coreference resolution in image narrations

Contact Info

Product

Resources

About