2021
DOI: 10.1007/978-3-030-86534-4_7
|View full text |Cite
|
Sign up to set email alerts
|

Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
1
1

Relationship

2
3

Authors

Journals

citations
Cited by 6 publications
(11 citation statements)
references
References 15 publications
0
11
0
Order By: Relevance
“…It focuses on encoding the relationship between sentences using sequence-to-sequence based encoding methods. These methods infer rationales by encoding the long dependency relationship between sentences (see, e.g., R2C [5] and TAB-VCR [10], DMVCR [13]). However, these models face difficulty with reasoning information lost based on long dependency structure, and it is hard for them to infer reason based on commonsense about the world.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…It focuses on encoding the relationship between sentences using sequence-to-sequence based encoding methods. These methods infer rationales by encoding the long dependency relationship between sentences (see, e.g., R2C [5] and TAB-VCR [10], DMVCR [13]). However, these models face difficulty with reasoning information lost based on long dependency structure, and it is hard for them to infer reason based on commonsense about the world.…”
Section: Related Workmentioning
confidence: 99%
“…-We theoretically analyze the inadequacy of previous work [13] and introduce a new effective model to reduce the sequential computation. In [13], the computational complexity of related signals from two positions grows with the distance between the positions, which results in difficulties in learning dependencies among positions for the sequential task. In the newly proposed multimodal feature fusion layer and commonsense encoder submodules, we designed a parallel attention structure to limit the number of operations.…”
Section: Introductionmentioning
confidence: 99%
“…R2C [5] is a representative example in this line of efforts in which attention based deep model is used for visual inferencing. More recently, a dynamic working memory based memory cells framework is proposed to provide prior knowledge for inference [14]. Our model more closely resembles this method with two distinctions: i) a parallel structure is explicitly designed to relax the dependence on the previous cells, alleviating the drawback of information lose of long dependency memory cell for long sequences, and ii) a newly proposed co-attention network rather than dynamic working memory cell to ease model training but also to enhance the capability of capturing relationship between sentences and semantic information from surrounding words.…”
Section: Related Workmentioning
confidence: 99%
“…We compare our method with several state-of-the-art visual scene understanding models based on the mean average precision metric for the three Q2A, QA2R and Q2AR tasks, respectively, including: 1) MUTAN [19] proposes a multimodal based visual question answering approach, which parametrizes bi-linear interactions between visual and textual representations using Tucker decomposition; 2) BERT-base [18] is a powerful pre-training based model in natural language field and is adapted for the commonsense reasoning; 3) R2C [5] encodes commonsense between sentences with LSTM; 4) DMVCR [14] trains a dynamic working memory to store the commonsense in training as well as using commonsense as prior knowledge for inference. Among them, BERT-base adopts pre-training method, while MUTAN, R2C and DMVCR are non pre-training methods.…”
Section: Understanding Visual Scenesmentioning
confidence: 99%
See 1 more Smart Citation