Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Tang, Xuejiao; Huang, Xin; Zhang, Wenbin; Child, Travers Barclay; Hu, Qiong; Liu, Zhen; Zhang, Ji

doi:10.1007/978-3-030-86534-4_7

Cited by 6 publications

(11 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It focuses on encoding the relationship between sentences using sequence-to-sequence based encoding methods. These methods infer rationales by encoding the long dependency relationship between sentences (see, e.g., R2C [5] and TAB-VCR [10], DMVCR [13]). However, these models face difficulty with reasoning information lost based on long dependency structure, and it is hard for them to infer reason based on commonsense about the world.…”

Section: Related Workmentioning

confidence: 99%

“…-We theoretically analyze the inadequacy of previous work [13] and introduce a new effective model to reduce the sequential computation. In [13], the computational complexity of related signals from two positions grows with the distance between the positions, which results in difficulties in learning dependencies among positions for the sequential task. In the newly proposed multimodal feature fusion layer and commonsense encoder submodules, we designed a parallel attention structure to limit the number of operations.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attention Mechanism based Cognition-level Scene Understanding

Tang¹,

Quy²,

Ntoutsi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Given a question-image input, the Visual Commonsense Reasoning (VCR) model can predict an answer with the corresponding rationale, which requires inference ability from the real world. The VCR task, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge, is a cognition-level scene understanding task. The VCR task has aroused researchers' interest due to its wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and losing information in long sequences. In this paper, we propose a parallel attention-based cognitive VCR network PAVCR, which fuses visual-textual information efficiently and encodes semantic information in parallel to enable the model to capture rich information for cognition-level inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Attention Mechanism based Cognition-level Scene Understanding

Tang¹,

Quy²,

Ntoutsi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…R2C [5] is a representative example in this line of efforts in which attention based deep model is used for visual inferencing. More recently, a dynamic working memory based memory cells framework is proposed to provide prior knowledge for inference [14]. Our model more closely resembles this method with two distinctions: i) a parallel structure is explicitly designed to relax the dependence on the previous cells, alleviating the drawback of information lose of long dependency memory cell for long sequences, and ii) a newly proposed co-attention network rather than dynamic working memory cell to ease model training but also to enhance the capability of capturing relationship between sentences and semantic information from surrounding words.…”

Section: Related Workmentioning

confidence: 99%

“…We compare our method with several state-of-the-art visual scene understanding models based on the mean average precision metric for the three Q2A, QA2R and Q2AR tasks, respectively, including: 1) MUTAN [19] proposes a multimodal based visual question answering approach, which parametrizes bi-linear interactions between visual and textual representations using Tucker decomposition; 2) BERT-base [18] is a powerful pre-training based model in natural language field and is adapted for the commonsense reasoning; 3) R2C [5] encodes commonsense between sentences with LSTM; 4) DMVCR [14] trains a dynamic working memory to store the commonsense in training as well as using commonsense as prior knowledge for inference. Among them, BERT-base adopts pre-training method, while MUTAN, R2C and DMVCR are non pre-training methods.…”

Section: Understanding Visual Scenesmentioning

confidence: 99%

See 1 more Smart Citation

Interpretable Visual Understanding with Cognitive Attention Network

Tang

Zhang

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

While image understanding on recognition-level has achieved remarkable advancements, reliable visual scene understanding requires comprehensive image understanding on recognition-level but also cognition-level, which calls for exploiting the multi-source information as well as learning different levels of understanding and extensive commonsense knowledge. In this paper, we propose a novel Cognitive Attention Network (CAN) for visual commonsense reasoning to achieve interpretable visual understanding. Specifically, we first introduce an image-text fusion module to fuse information from images and text collectively. Second, a novel inference module is designed to encode commonsense among image, query and response. Extensive experiments on largescale Visual Commonsense Reasoning (VCR) benchmark dataset demonstrate the effectiveness of our approach. The implementation is publicly available at https://github.com/tanjatang/CAN

show abstract

Interpretable Visual Understanding with Cognitive Attention Network

Tang

Zhang

et al. 2021

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Cited by 6 publications

References 15 publications

Attention Mechanism based Cognition-level Scene Understanding

Attention Mechanism based Cognition-level Scene Understanding

Interpretable Visual Understanding with Cognitive Attention Network

Interpretable Visual Understanding with Cognitive Attention Network

Contact Info

Product

Resources

About