2019
DOI: 10.48550/arxiv.1908.05054
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Fusion of Detected Objects in Text for Visual Question Answering

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
52
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 36 publications
(52 citation statements)
references
References 0 publications
0
52
0
Order By: Relevance
“…Moreover, various pre-training objectives are also proposed to utilize these datasets effectively. The most widely used objectives are image-text retrieval [2,37,47,55,63,64,65], masked language modeling with image clues [2,37,47,62,63,64,65], and masked region modeling [14,47,62,63,65]. Among them, masked region modeling requires regional features extracted by offthe-shelf object detectors.…”
Section: Related Workmentioning
confidence: 99%
“…Moreover, various pre-training objectives are also proposed to utilize these datasets effectively. The most widely used objectives are image-text retrieval [2,37,47,55,63,64,65], masked language modeling with image clues [2,37,47,62,63,64,65], and masked region modeling [14,47,62,63,65]. Among them, masked region modeling requires regional features extracted by offthe-shelf object detectors.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, neural networks raise more attention in fusion especially since the appearance of RNN and LSTM [36,47]. More recently, transformer-based [51] fusion raises growing attention [1,48,37,16,21], especially after its application in vision [7]. In addition to that, there are also some modelagnostic fusion methods, including the simple concatenation [27,6,58] and element-wise operation [8,50].…”
Section: Related Workmentioning
confidence: 99%
“…For example, Gan et al [10] proposed a multi-step reasoning approach to answer a series of questions about an image with the recurrent dual attention mechanism. Recently, vision and language pre-training that aims to build joint cross-modal representations has attracted lots of attention from researchers [1,25,28,37,51,52]. Models based on Transformer encoder are designed for visually-grounded tasks and yield prominent improvement mainly on vision-language understanding.…”
Section: Visual Dialoguementioning
confidence: 99%
“…Besides, almost each post-response pair 1 in a open-domain dialogue has following two features: (1) They do not share the same semantic space, and topic transition often occurs; (2) Rather than word-level alignments, utterance-level semantic dependency exists in each pair. Therefore, when integrating visual impressions into open-domain dialogue generation, we need to take advantage of both post visual impressions (PVIs) and response visual impressions (RVIs).…”
mentioning
confidence: 99%