Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Srivastava, Yash; Murali, Vaishnav; Dubey, Shiv Ram; Mukherjee, Snehasis

doi:10.1007/978-981-16-1092-9_7

Cited by 20 publications

(16 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NLP must be assisted by multimodal control interfaces, identification and understanding of human behavior, and collaborative decision-making between the system and individuals or groups to understand the requirements of the customer and other stakeholders [77]. Visual question answering is a method that addresses the challenging unimodal aspect of NLP systems [78]. Many other methods are used to integrate multimodality into NLP structures, including declarative learning-based programming [79], multimodal datasets [80], procedural reasoning networks [81], and unified attention networks [82].…”

Section: Current Limitations Of Nlp In Requirements Elicitation and Requirements Analysismentioning

confidence: 99%

A Bird’s Eye View of Natural Language Processing and Requirements Engineering

Alzayed¹,

Al-Hunaiyyan²

2021

IJACSA

View full text Add to dashboard Cite

Natural Language Processing (NLP) has demonstrated effectiveness in many application domains. NLP can assist software engineering by automating various activities. This paper examines the interaction between software requirements engineering (RE) and NLP. We reviewed the current literature to evaluate how NLP supports RE and to examine research developments. This literature review indicates that NLP is being employed in all the phases of the RE domain. This paper focuses on the phases of elicitation and the analysis of requirements. RE communication issues are primarily associated with the elicitation and analysis phases of the requirements. These issues include ambiguity, inconsistency, and incompleteness. Many of these problems stem from a lack of participation by the stakeholders in both phases. Thus, we address the application of NLP during the process of requirements elicitation and analysis. We discuss the limitations of NLP in these two phases. Potential future directions for the domain are examined. This paper asserts that human involvement with knowledge about the domain and the specific project is still needed in the RE process despite progress in the development of NLP systems.

show abstract

Section: Current Limitations Of Nlp In Requirements Elicitation and Requirements Analysismentioning

confidence: 99%

A Bird’s Eye View of Natural Language Processing and Requirements Engineering

Alzayed¹,

Al-Hunaiyyan²

2021

IJACSA

View full text Add to dashboard Cite

show abstract

“…VQA Datasets. Many large-scale VQA datasets have been proposed over the past six years [19,33,34,37]. A key challenge the community has faced in developing such datasets is the language bias problem [12,23,27,30].…”

Section: Related Workmentioning

confidence: 99%

Grounding Answers for Visual Questions Asked by Visually Impaired People

Chen¹,

Anjum²,

Gurari³

2022

Preprint

View full text Add to dashboard Cite

Visual question answering is the task of answering questions about images. We introduce the VizWiz-VQA-Grounding dataset, the first dataset that visually grounds answers to visual questions asked by people with visual impairments. We analyze our dataset and compare it with five VQA-Grounding datasets to demonstrate what makes it similar and different. We then evaluate the SOTA VQA and VQA-Grounding models and demonstrate that current SOTA algorithms often fail to identify the correct visual evidence where the answer is located. These models regularly struggle when the visual evidence occupies a small fraction of the image, for images that are higher quality, as well as for visual questions that require skills in text recognition. The dataset, evaluation server, and leaderboard all can be found at the following link: https://vizwiz.org/tasksand-datasets/answer-grounding-for-vqa/.

show abstract

“…This pioneering work was immediately followed by a vigorous worldwide effort aimed at building new datasets and models (Antol et al., 2015; Gao et al., 2015; Geman et al., 2015; Goyal et al., 2016, 2017; Malinowski et al., 2015; M. Ren, Kiros et al., 2015; Yu et al., 2015). This effort has been exhaustively summarized in various surveys (Kafle & Kanan, 2017b; Manmadhan & Kovoor, 2020; Srivastava et al., 2021; Wu et al., 2017), as well as tutorials (Kordjamshidi et al., 2020; Teney et al., 2017). 1 In particular, Srivastava et al.…”

Section: The Recent Revival Of Vqamentioning

confidence: 99%

“…1 In particular, Srivastava et al. (2021) nicely sketch the timeline of the major breakthroughs in VQA in the last five years, whilst Wu et al. (2017) provide interesting connections with structured knowledge base and an in‐depth description of the question/answer pairs present in VQA datasets.…”

Section: The Recent Revival Of Vqamentioning

confidence: 99%

“…Building on this early VQA baseline, a plethora of models have been proposed. Since exhaustive overview papers are already available (Kafle & Kanan, 2017b; Manmadhan & Kovoor, 2020; Srivastava et al., 2021; Wu et al., 2017), here we do not review all the approaches and models that have been proposed. Instead, we highlight and explain the major milestones that have been achieved and that we can relate to our desiderata listed in Section 1.…”

Section: The Recent Revival Of Vqamentioning

confidence: 99%

See 1 more Smart Citation

Linguistic issues behind visual question answering

Bernardi

Pezzelle

2021

Language and Linguist. Compass

View full text Add to dashboard Cite

Answering a question that is grounded in an image is a crucial ability that requires understanding the question, the visual context, and their interaction at many linguistic levels: among others, semantics, syntax and pragmatics. As such, visually‐grounded questions have long been of interest to theoretical linguists and cognitive scientists. Moreover, they have inspired the first attempts to computationally model natural language understanding, where pioneering systems were faced with the highly challenging task—still unsolved—of jointly dealing with syntax, semantics and inference whilst understanding a visual context. Boosted by impressive advancements in machine learning, the task of answering visually‐grounded questions has experienced a renewed interest in recent years, to the point of becoming a research sub‐field at the intersection of computational linguistics and computer vision. In this paper, we review current approaches to the problem which encompass the development of datasets, models and frameworks. We conduct our investigation from the perspective of the theoretical linguists; we extract from pioneering computational linguistic work a list of desiderata that we use to review current computational achievements. We acknowledge that impressive progress has been made to reconcile the engineering with the theoretical view. At the same time, we claim that further research is needed to get to a unified approach which jointly encompasses all the underlying linguistic problems. We conclude the paper by sharing our own desiderata for the future.

show abstract

Visual Question Answering Using Deep Learning: A Survey and Performance Analysis

Cited by 20 publications

References 35 publications

A Bird’s Eye View of Natural Language Processing and Requirements Engineering

A Bird’s Eye View of Natural Language Processing and Requirements Engineering

Grounding Answers for Visual Questions Asked by Visually Impaired People

Linguistic issues behind visual question answering

Contact Info

Product

Resources

About