Spatially Aware Multimodal Transformers for TextVQA

Kant, Yash; Batra, Dhruv; Anderson, Peter; Schwing, Alexander G.; Parikh, Devi; Lü, Jing; Agrawal, Harsh

doi:10.1007/978-3-030-58545-7_41

Cited by 69 publications

(63 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Attention models introduced in Xu, Ba, Kiros, Cho, et al (2015); further improved performance in image captioning and were refined in bottom-up and top-down attention models (Anderson et al, 2018a). Transformers models (Vaswani et al, 2017) have been adapted to multimodal scenarios, such as image captioning and visual question answering (VQA) in works like Kant et al (2020) and Luo et al (2019), which won the conceptual captions challenge on GCC dataset in 2019 (Sharma et al, 2018). Generic image captioning systems were trained on MS-COCO or GCC benchmark using cross-entropy training.…”

Section: Related Workmentioning

confidence: 99%

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Dognin¹,

Melnyk²,

Mroueh³

et al. 2022

jair

View full text Add to dashboard Cite

Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems. This article appears in the special track on AI & Society.

show abstract

Section: Related Workmentioning

confidence: 99%

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Dognin¹,

Melnyk²,

Mroueh³

et al. 2022

jair

View full text Add to dashboard Cite

show abstract

“…Inspired by [14,20,43], we calculate the similarity between each paired regions by their Intersection over Union (IoU) score. The region pairs with IoU scores larger than zero are considered to have edges in E and their IoU scores are regarded as their similarities in S. For the text graph, we use an off-the-shelf scene graph parser provided by [1] to obtain a text scene graph from a text.…”

Section: Knowledge Extractionmentioning

confidence: 99%

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Cui

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cROSs-and InTrA-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-ofdomain datasets, ROSITA significantly outperforms existing stateof-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets. CCS CONCEPTS• Computing methodologies → Multi-task learning.

show abstract

“…Some recent works construct graphs for visual or linguistic elements in V+L tasks, such as VQA [16,27,43], VideoQA [28,30,78], Image Captioning [23,69,75], and Visual Grounding [31,47,68], to reveal relationships between these elements and obtain fine-grained semantic representations. These constructed graphs can be broadly grouped into three types: visual graphs between image objects/regions (e.g., [69]), linguistic graphs between sentence elements/tokens (e.g., [33]), and crossmodal graphs among visual and linguistic elements (e.g., [47]). In this work, we construct the visual graph for X-GGM.…”

Section: Graph Construction In V+l Tasksmentioning

confidence: 99%

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Jiang

Liu

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components. CCS CONCEPTS• Computing methodologies → Computer vision tasks; • Information systems → Question answering.

show abstract

Spatially Aware Multimodal Transformers for TextVQA

Cited by 69 publications

References 29 publications

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

Contact Info

Product

Resources

About