Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2022
DOI: 10.18653/v1/2022.acl-long.66
|View full text |Cite
|
Sign up to set email alerts
|

Premise-based Multimodal Reasoning: Conditional Inference on Joint Textual and Visual Clues

Abstract: It is a common practice for recent works in vision language cross-modal reasoning to adopt a binary or multi-choice classification formulation taking as input a set of source image(s) and textual query. In this work, we take a sober look at such an "unconditional" formulation in the sense that no prior knowledge is specified with respect to the source image(s). Inspired by the designs of both visual commonsense reasoning and natural language inference tasks, we propose a new task termed "Premise-based Multi-mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…Person-Centric Vision-Language Task The person-centric vision-language task (Zellers et al, 2019;Dong et al, 2022;Cui et al, 2021;You et al, 2022), is mainly based on grounding references to a person; therefore, person-centric visual grounding ability is a crucial component. VCR (Zellers et al, 2019) is a task that answers commonsensical questions about the people depicted in an image.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Person-Centric Vision-Language Task The person-centric vision-language task (Zellers et al, 2019;Dong et al, 2022;Cui et al, 2021;You et al, 2022), is mainly based on grounding references to a person; therefore, person-centric visual grounding ability is a crucial component. VCR (Zellers et al, 2019) is a task that answers commonsensical questions about the people depicted in an image.…”
Section: Related Workmentioning
confidence: 99%
“…In the fields of vision and language, several datasets (Zellers et al, 2019;Park et al, 2020;Lei et al, 2020;Dong et al, 2022;You et al, 2022) that focus on the reasoning about individuals using visual commonsense knowledge have been proposed. In one of the notable datasets, VCR (Zellers et al, 2019), the models are required to provide answers with justifications for commonsense questions related to the individuals in the given images.…”
Section: Introductionmentioning
confidence: 99%