Fusion of Detected Objects in Text for Visual Question Answering

Alberti, Chris; Ling, Jeffrey; Collins, Michael J.; Reitter, David

doi:10.48550/arxiv.1908.05054

Cited by 36 publications

(52 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, various pre-training objectives are also proposed to utilize these datasets effectively. The most widely used objectives are image-text retrieval [2,37,47,55,63,64,65], masked language modeling with image clues [2,37,47,62,63,64,65], and masked region modeling [14,47,62,63,65]. Among them, masked region modeling requires regional features extracted by offthe-shelf object detectors.…”

Section: Related Workmentioning

confidence: 99%

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Zhu¹,

Zhu²,

et al. 2021

Preprint

View full text Add to dashboard Cite

Biological intelligence systems of animals perceive the world by integrating information in different modalities and processing simultaneously for various tasks. In contrast, current machine learning research follows a task-specific paradigm, leading to inefficient collaboration between tasks and high marginal costs of developing perception models for new tasks. In this paper, we present a generic perception architecture named Uni-Perceiver, which processes a variety of modalities and tasks with unified modeling and shared parameters. Specifically, Uni-Perceiver encodes different task inputs and targets from arbitrary modalities into a unified representation space with a modality-agnostic Transformer encoder and lightweight modality-specific tokenizers. Different perception tasks are modeled as the same formulation, that is, finding the maximum likelihood target for each input through the similarity of their representations. The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage. Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks. The performance can be improved to a level close to state-of-the-art methods by con-

show abstract

Section: Related Workmentioning

confidence: 99%

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Zhu¹,

Zhu²,

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Besides, neural networks raise more attention in fusion especially since the appearance of RNN and LSTM [36,47]. More recently, transformer-based [51] fusion raises growing attention [1,48,37,16,21], especially after its application in vision [7]. In addition to that, there are also some modelagnostic fusion methods, including the simple concatenation [27,6,58] and element-wise operation [8,50].…”

Section: Related Workmentioning

confidence: 99%

Audio-Visual Transformer Based Crowd Counting

Sajid

Chen

Sajid

et al. 2021

Preprint

View full text Add to dashboard Cite

Crowd estimation is a very challenging problem. The most recent study tries to exploit auditory information to aid the visual models, however, the performance is limited due to the lack of an effective approach for feature extraction and integration. The paper proposes a new audiovisual multi-task network to address the critical challenges in crowd counting by effectively utilizing both visual and audio inputs for better modalities association and productive feature extraction. The proposed network introduces the notion of auxiliary and explicit image patch-importance ranking (PIR) and patch-wise crowd estimate (PCE) information to produce a third (run-time) modality. These modalities (audio, visual, run-time) undergo a transformer-inspired cross-modality co-attention mechanism to finally output the crowd estimate. To acquire rich visual features, we propose a multi-branch structure with transformer-style fusion in-between. Extensive experimental evaluations show that the proposed scheme outperforms the state-of-the-art networks under all evaluation settings with up to 33.8% improvement. We also analyze and compare the vision-only variant of our network and empirically demonstrate its superiority over previous approaches.

show abstract

“…For example, Gan et al [10] proposed a multi-step reasoning approach to answer a series of questions about an image with the recurrent dual attention mechanism. Recently, vision and language pre-training that aims to build joint cross-modal representations has attracted lots of attention from researchers [1,25,28,37,51,52]. Models based on Transformer encoder are designed for visually-grounded tasks and yield prominent improvement mainly on vision-language understanding.…”

Section: Visual Dialoguementioning

confidence: 99%

“…Besides, almost each post-response pair 1 in a open-domain dialogue has following two features: (1) They do not share the same semantic space, and topic transition often occurs; (2) Rather than word-level alignments, utterance-level semantic dependency exists in each pair. Therefore, when integrating visual impressions into open-domain dialogue generation, we need to take advantage of both post visual impressions (PVIs) and response visual impressions (RVIs).…”

mentioning

confidence: 99%

Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation

Shen,

Zhan,

Shen

et al. 2021

Preprint

View full text Add to dashboard Cite

Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided directly during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get those RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines. CCS CONCEPTS• Computing methodologies → Discourse, dialogue and pragmatics; Natural language generation;

show abstract

Fusion of Detected Objects in Text for Visual Question Answering

Cited by 36 publications

References 0 publications

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

Audio-Visual Transformer Based Crowd Counting

Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation

Contact Info

Product

Resources

About