Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

Qingxiu, Dong,; Qin, Ziwei; Xia, Heming; Tian, Feng; Tong, Shoujie; Meng, Hui; Xu, Lin; Wei, Zhongyu; Zhan, Wei; Chang, Baobao; Li, Sujian; Li, Tianyu; Sui, Zhifang

doi:10.18653/v1/2022.acl-long.66

Cited by 1 publication

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Person-Centric Vision-Language Task The person-centric vision-language task (Zellers et al, 2019;Dong et al, 2022;Cui et al, 2021;You et al, 2022), is mainly based on grounding references to a person; therefore, person-centric visual grounding ability is a crucial component. VCR (Zellers et al, 2019) is a task that answers commonsensical questions about the people depicted in an image.…”

Section: Related Workmentioning

confidence: 99%

“…In the fields of vision and language, several datasets (Zellers et al, 2019;Park et al, 2020;Lei et al, 2020;Dong et al, 2022;You et al, 2022) that focus on the reasoning about individuals using visual commonsense knowledge have been proposed. In one of the notable datasets, VCR (Zellers et al, 2019), the models are required to provide answers with justifications for commonsense questions related to the individuals in the given images.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Examining Consistency of Visual Commonsense Reasoning based on Person Grounding

Kim,

Kang,

Lee

2023

Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacifi

View full text Add to dashboard Cite

Given an image depicting multiple individuals, humans are capable of inferring each individual's emotions, intentions, and social norms based on commonsense understanding. However, a machine's ability of commonsense reasoning about distinct individuals in images remains underexplored. In this study, we examine the consistency of visual commonsense reasoning based on person grounding. We introduce a novel test dataset called Visual Commonsense Reasoning-Contrast Sets (VCR-CS) to evaluate whether models can reason about individual people in an image by changing the person tags in the questions and answers. We benchmark various vision-language models on VCR-CS and observe that they fail in consistent commonsense reasoning about different people in one image, showing a performance decrease of up to 31.5%. To mitigate such failures, we propose a multi-task learning framework called Personcentric groundIng eNhanced Tuning (PINT). Our framework enhances a model's ability to perform person-grounded commonsense reasoning by leveraging two novel person-centric pretraining tasks: Image Person-based Text Matching and Person-Masked Language Modeling. The experimental results revealed the effectiveness of PINT by showing the lowest performance degradation on VCR-CS and the improvements in consistency and sensitivity metrics. Our dataset and code are publicly available 1 .

show abstract