ViQuAE, a Dataset for Knowledge-based Visual Question Answering about Named Entities

Lerner, Paul; Ferret, Olivier; Guinaudeau, Camille; Borgne, Hervé Le; Besançon, Romaric; Moreno, José G.; Lovón-Melgarejo, Jesús

doi:10.1145/3477495.3531753

Cited by 11 publications

(27 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Results on the ViQuAE test set are shown in Table 1. Surprisingly, we find that PaLM can read questions and generate answers with 31.5% accuracy, outperforming the SOTA retrieval-based model [24] (which has access to the image) on this dataset by 9.4%. Although PaLM is a much larger model, this experiment illustrates that it is possible to achieve very good performance on Vi-QuAE without using information from the image.…”

Section: The Need For a New Visual Information Seeking Benchmarkmentioning

confidence: 75%

“…Early efforts in this area, such as KBQA [50] and FVQA [49], were based on domainspecific knowledge graphs, while recent datasets like OK-VQA [33] and A-OKVQA [45] have improved upon this foundation by incorporating an open-domain approach and highlighting common-sense knowledge. Among the existing benchmarks, K-VQA [44] and ViQuAE [24] are most relevant to our study, but they have limitations in their question generation process, as discussed below. In our analysis, we focus on three crucial aspects when evaluating the per-formance of pre-trained models on these benchmarks: (1) the level of information-seeking intent, (2) the reliance on visual understanding, and (3) coverage of diverse entities.…”

Section: The Need For a New Visual Information Seeking Benchmarkmentioning

confidence: 99%

“…We split the dataset to ensure overfitting to the training set through memorization is useless, thereby emphasizing the importance of pre-training to acquire knowledge or the use of an external knowledge base (see § 3.3. Table 2 summarizes the dataset statistics, including the number of entity types calculated following [24].…”

Section: Infoseek: a Benchmark Of Visual Infoseeking Questionsmentioning

confidence: 99%

“…Existing knowledge-intensive VQA benchmarks such as OK-VQA [33] and ViQuAE [24] only cover limited aspects of visual info-seeking questions, so they cannot help us with our research question. These questions often reveal the answer, making it easier for models to predict without looking at the images.…”

Section: Introductionmentioning

confidence: 99%

“…In the No-KB protocol, only the question and image are provided to the model without any use of an external knowledge base. In this setting, we evalu- For ViQuAE, the Question-only model (based on PaLM (540B)) performs better than prior state-of-the-art [24], without looking at the image at all. For OK-VQA [33], annotators think most of the examples can be answered directly without external knowledge.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Chen¹,

Hu²,

Luan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Large language models [5,7] have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training [2,6,38], do these models also understand how to answer visual information seeking questions? To answer this question, we present INFOSEEK 1 , aVisual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking questionanswer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation.Based on INFOSEEK, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated INFOSEEK dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.

show abstract

Section: The Need For a New Visual Information Seeking Benchmarkmentioning

confidence: 75%