2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00519
|View full text |Cite
|
Sign up to set email alerts
|

Transparency by Design: Closing the Gap Between Performance and Interpretability in Visual Reasoning

Abstract: Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current stateof-the-art approaches do not provide an effective mechanism for understandi… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
172
0
1

Year Published

2018
2018
2021
2021

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 169 publications
(173 citation statements)
references
References 23 publications
0
172
0
1
Order By: Relevance
“…Very interestingly, we found that the module trained on translating the last module output to segmentation mask is general, and can produce excellent humaninterpretable segmentation masks when attached to intermediate module outputs, revealing the entire reasoning process. We believe ours is the first to show clean visualization of the visual reasoning process carried out by neural module networks, as opposed to gradient norms [16] or soft attention maps [27,9].…”
Section: Introductionmentioning
confidence: 95%
“…Very interestingly, we found that the module trained on translating the last module output to segmentation mask is general, and can produce excellent humaninterpretable segmentation masks when attached to intermediate module outputs, revealing the entire reasoning process. We believe ours is the first to show clean visualization of the visual reasoning process carried out by neural module networks, as opposed to gradient norms [16] or soft attention maps [27,9].…”
Section: Introductionmentioning
confidence: 95%
“…Neural Modular Network Neural Modular Network (NMN) is a class of models that are composed of a number of sub-modules, where each sub-module is capable of performing a specific subtask. In NMN (Andreas et al, 2016b), N2NMN (Hu et al, 2017), PG+EE (Johnson et al, 2017), GroundNet (Cirik et al, 2018) and TbD (Mascharka et al, 2018), the entire reasoning procedure starts by analyzing the question and decomposing the reasoning procedure into a sequence of sub-tasks, each with a corresponding module. This is done by either a parser (Andreas et al, 2016b;Cirik et al, 2018) or a layout policy (Hu et al, 2017;Johnson et al, 2017;Mascharka et al, 2018) that turns the question into a module layout.…”
Section: Related Workmentioning
confidence: 99%
“…Object bounding boxes are graph nodes while edges are learned using an attention model conditioned on the question. Mascharka et al [27] augment a deep network architecture with an image-space attention mechanism based on a set of composable visual reasoning primitives that help examine the intermediate outputs of each module. Li et al [23] introduce a captioning model to generate an image's description, reason with the caption and the question to construct an answer, and use the caption to explain the answer.…”
Section: Related Workmentioning
confidence: 99%