Visuo-Linguistic Question Answering (VLQA) Challenge

Sampat, Shailaja Keyur; Yang, Yezhou; Baral, Chitta

doi:10.18653/v1/2020.findings-emnlp.413

Cited by 8 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Knowledge-based visual question answering (Wang et al, 2017(Wang et al, , 2018Marino et al, 2019;Sampat et al, 2020) proposed benchmark datasets for knowledge-based visual question answering that requires reasoning about an image on the basis of facts from a large-scale knowledge base (KB) such as Freebase (Bollacker et al, 2008) or DBPedia (Auer et al, 2007). To solve the task, two pioneering studies (Wang et al, 2017(Wang et al, , 2018 suggested logical parsing-based methods which convert a question to a KB logic query using predefined query templates and execute the generated query on KB for searching an answer.…”

Section: Related Workmentioning

confidence: 99%

Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering

Heo¹,

Kim²,

Choi³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Knowledge-based visual question answering (QA) aims to answer a question which requires visually-grounded external knowledge beyond image content itself. Answering complex questions that require multi-hop reasoning under weak supervision is considered as a challenging problem since i) no supervision is given to the reasoning process and ii) highorder semantics of multi-hop knowledge facts need to be captured. In this paper, we introduce a concept of hypergraph to encode highlevel semantics of a question and a knowledge base, and to learn high-order associations between them. The proposed model, Hypergraph Transformer, constructs a question hypergraph and a query-aware knowledge hypergraph, and infers an answer by encoding inter-associations between two hypergraphs and intra-associations in both hypergraph itself. Extensive experiments on two knowledgebased visual QA and two knowledge-based textual QA demonstrate the effectiveness of our method, especially for multi-hop reasoning problem. Our source code is available at https://github.com/yujungheo/ kbvqa-public.

show abstract

Section: Related Workmentioning

confidence: 99%

Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering

Heo¹,

Kim²,

Choi³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…Scientific problem solving has recently been employed to evaluate the multi-hop reasoning capability and interpretability of AI systems (Kembhavi et al 2017;Sampat, Yang, and Baral 2020;Dalvi et al 2021). However, these datasets (Kembhavi et al 2017;Jansen et al 2018) suffer from limited scale.…”

Section: Introductionmentioning

confidence: 99%

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering

Wang,

Hu,

et al. 2024

AAAI

View full text Add to dashboard Cite

Large Language Models (LLMs) have recently demonstrated exceptional performance in various Natural Language Processing (NLP) tasks. They have also shown the ability to perform chain-of-thought (CoT) reasoning to solve complex problems. Recent studies have explored CoT reasoning in complex multimodal scenarios, such as the science question answering task, by fine-tuning multimodal models with high-quality human-annotated CoT rationales. However, collecting high-quality COT rationales is usually time-consuming and costly. Besides, the annotated rationales are hardly accurate due to the external essential information missed. To address these issues, we propose a novel method termed T-SciQ that aims at teaching science question answering with LLM signals. The T-SciQ approach generates high-quality CoT rationales as teaching signals and is advanced to train much smaller models to perform CoT reasoning in complex modalities. Additionally, we introduce a novel data mixing strategy to produce more effective teaching data samples for simple and complex science question answer problems. Extensive experimental results show that our T-SciQ method achieves a new state-of-the-art performance on the ScienceQA benchmark, with an accuracy of 96.18%. Moreover, our approach outperforms the most powerful fine-tuned baseline by 4.5%. The code is publicly available at https://github.com/T-SciQ/T-SciQ.

show abstract

“…Scientific problem solving has recently been employed to evaluate the multi-hop reasoning capability and interpretability of AI systems [6,12,28]. However, these datasets [11,12] suffer from limited scale.…”

Section: Introductionmentioning

confidence: 99%

Preface: celebrating the 60th anniversary of the University of Science and Technology of China

Yang

2018

Sci. China Chem.

View full text Add to dashboard Cite

Most Neural Radiance Fields (NeRFs) have poor generalization ability, limiting their application when representing multiple scenes by a single model. To ameliorate this problem, existing methods simply condition NeRF models on image features, lacking the global understanding and modeling of the entire 3D scene. Inspired by the significant success of mask-based modeling in other research fields, we propose a masked ray and view modeling method for generalizable NeRF (MRVM-NeRF), the first attempt to incorporate mask-based pretraining into 3D implicit representations. Specifically, considering that the core of NeRFs lies in modeling 3D representations along the rays and across the views, we randomly mask a proportion of sampled points along the ray at fine stage by discarding partial information obtained from multi-viewpoints, targeting at predicting the corresponding features produced in the coarse branch. In this way, the learned prior knowledge of 3D scenes during pretraining helps the model generalize better to novel scenarios after finetuning. Extensive experiments demonstrate the superiority of our proposed MRVM-NeRF under various synthetic and real-world settings, both qualitatively and quantitatively. Our empirical studies reveal the effectiveness of our proposed innovative MRVM which is specifically designed for NeRF models.

show abstract

Visuo-Linguistic Question Answering (VLQA) Challenge

Cited by 8 publications

References 29 publications

Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering

Hypergraph Transformer: Weakly-Supervised Multi-hop Reasoning for Knowledge-based Visual Question Answering

T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering

Preface: celebrating the 60th anniversary of the University of Science and Technology of China

Contact Info

Product

Resources

About