TallyQA: Answering Complex Counting Questions

Acharya, Manoj; Kafle, Kushal; Kanan, Christopher

doi:10.1609/aaai.v33i01.33018076

Cited by 43 publications

(46 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Synsets Figure 4 shows the counts of examples per synset in the training and development sets. Image Pair Reasoning We use a 200-sentence subset of the sentences analyzed in Table 5 (3) existential and (4) universal quantifiers; (5) coordination; (6) coreference; (7) spatial relations; (8) presupposition; (9) preposition attachment ambiguity VQA1.0 (Antol et al, 2015), VQA-CP (Agrawal et al, 2017), VQA2.0 (Goyal et al, 2017) Visual Question Answering Referring Expression Generation VQA (Abstract) (Zitnick and Parikh, 2013) Visual Question Answering ReferItGame (Kazemzadeh et al, 2014) Referring Expression Resolution SHAPES (Andreas et al, 2016) Visual Question Answering Bisk et al (2016) Instruction Following MSCOCO (Chen et al, 2016) Caption Generation Google RefExp (Mao et al, 2016) Referring Expression Resolution ROOM-TO-ROOM (Anderson et al, 2018) Instruction Following Visual Dialog (Das et al, 2017) Dialogue Visual Question Answering CLEVR (Johnson et al, 2017a) Visual Question Answering CLEVR-Humans (Johnson et al, 2017b) Visual Question Answering TDIUC (Kafle and Kanan, 2017) Visual Question Answering ShapeWorld (Kuhnle and Copestake, 2017) Binary Sentence Classification FigureQA (Kahou et al, 2018) Visual Question Answering TVQA (Lei et al, 2018) Video Question Answering LANI & CHAI (Misra et al, 2018) Instruction Following Talk the Walk (de Vries et al, 2018) Dialogue Instruction Following COG (Yang et al, 2018) Visual Question Answering; Instruction Following VCR (Zellers et al, 2019) Visual Question Answering TallyQA (Acharya et al, 2019) Visual Question Answering What to avoid…”

Section: Additional Data Analysismentioning

confidence: 99%

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr

Zhou²,

Zhang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

278

241

View full text Add to dashboard Cite

We introduce a new dataset for joint reasoning about natural language and images, with a focus on semantic diversity, compositionality, and visual reasoning challenges. The data contains 107,292 examples of English sentences paired with web photographs. The task is to determine whether a natural language caption is true about a pair of photographs. We crowdsource the data using sets of visually rich images and a compare-and-contrast task to elicit linguistically diverse language. Qualitative analysis shows the data requires compositional joint reasoning, including about quantities, comparisons, and relations. Evaluation using state-of-the-art visual reasoning methods shows the data presents a strong challenge. * Contributed equally. † Work done as an undergraduate at Cornell University. 1 In parts of this paper, we use the term compositional differently than it is commonly used in linguistics to refer to reasoning that requires composition. This type of reasoning often manifests itself in highly compositional language.The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.One image shows exactly two brown acorns in back-to-back caps on green foliage.

show abstract

Section: Additional Data Analysismentioning

confidence: 99%

A Corpus for Reasoning about Natural Language Grounded in Photographs

Suhr

Zhou²,

Zhang³

et al. 2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

278

241

View full text Add to dashboard Cite

show abstract

“…Tally-QA: Very recently, in 2019, the Tally-QA [1] dataset is proposed which is the largest dataset of object counting in the open-ended task. The dataset includes both simple and complex question types which can be seen in Fig.…”

Section: Datasetsmentioning

confidence: 99%

“…In this survey, first we cover major datasets published for validating the Visual Question Answering task, such as VQA dataset [2], DAQUAR [19], Visual7W [38] and most recent datasets up to 2019 include Tally-QA [1] and KVQA [25]. Next, we discuss the state-of-the-art architectures designed for the task of Visual Question Answering such as Vanilla VQA [2], Stacked Attention Networks [32] and Pythia v1.0 [10].…”

Section: Introductionmentioning

confidence: 99%

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Srivastava

Murali

Dubey

et al. 2021

Communications in Computer and Information Science

View full text Add to dashboard Cite

The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic 'common sense' questions about given images. Given an image and a question in natural language, the VQA system tries to find the correct answer to it using visual elements of the image and inference gathered from textual questions. In this survey, we cover and discuss the recent datasets released in the VQA domain dealing with various types of question-formats and robustness of the machine-learning models. Next, we discuss about new deep learning models that have shown promising results over the VQA datasets. At the end, we present and discuss some of the results computed by us over the vanilla VQA model, Stacked Attention Network and the VQA Challenge 2017 winner model. We also provide the detailed analysis along with the challenges and future research directions.

show abstract

“…Reasoning-based VQA Reasoning-based VQA datasets aim at measuring a system's capability to reason about a set of objects, their attributes and relationships. HowManyQA (Trott et al, 2017) and TallyQA (Acharya et al, 2019) have object counting questions over images. SNLI-VE (Xie et al, 2019), VCOPA (Yeo et al, 2018) focus on causal reasoning whereas CLEVR (Johnson et al, 2017), NLVR (Suhr et al, 2017) target spatial reasoning.…”

Section: Visual Question Answering (Vqa)mentioning

confidence: 99%

Visuo-Linguistic Question Answering (VLQA) Challenge

Sampat

Yang

Baral

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.

show abstract

TallyQA: Answering Complex Counting Questions

Cited by 43 publications

References 1 publication

A Corpus for Reasoning about Natural Language Grounded in Photographs

A Corpus for Reasoning about Natural Language Grounded in Photographs

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Visuo-Linguistic Question Answering (VLQA) Challenge

Contact Info

Product

Resources

About