Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA

Hu, Ronghang; Singh, Amanpreet; Darrell, Trevor; Rohrbach, Marcus

doi:10.1109/cvpr42600.2020.01001

Cited by 148 publications

(168 citation statements)

References 28 publications

Supporting

Mentioning

168

Contrasting

Order By: Relevance

“…However, this common embedding space has difficulty utilizing the image object features. We observe this by training the M4C (Hu et al, 2020) network without the image object modality. The accuracy is almost unaffected.…”

Section: Context-enriched Ocr Representationmentioning

confidence: 94%

“…The generated answer could be selected from a fixed answer vocabulary or one of the OCR tokens by the copy module. The copy module is further improved by M4C (Hu et al, 2020) using dynamic pointer network. The M4C also proposes a transformer based network with 3 multi-modal input (question, image object features and OCR features).…”

Section: Text Visual Question Answeringmentioning

confidence: 99%

“…Existing work (Hu et al, 2020) builds a common embedding space for all modalities. However, this common embedding space has difficulty utilizing the image object features.…”

Section: Context-enriched Ocr Representationmentioning

confidence: 99%

“…However, lacking the ability to generate answers based on texts in the image limits its applications. Recently, many new datasets (Biten et al, 2019a; and new methods Hu et al, 2020) are proposed to tackle this challenge and refer it as text VQA.…”

Section: Introductionmentioning

confidence: 99%

“…The earliest method for text VQA is LoRRA , which provides an optical character recognition (OCR) module for the VQA input and proposes a dynamic copy mechanism to select the answer from both fixed vocabulary and OCR words. The following work M4C (Hu et al, 2020) inspired by LoRRA, uses rich representations of OCR as input and utilizes dynamic pointer network to deal with out-of-vocabulary answers, leading to state-of-the-art performance. However, M4C simply concatenates all modalities as transformer input and does not consider the high-level interaction among modalities of text VQA.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Finding the Evidence: Localization-aware Answer Prediction for Text Visual Question Answering

Han

Huang

Han

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Image text carries essential information to understand the scene and perform reasoning. Textbased visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localizationaware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin. *

show abstract

Section: Context-enriched Ocr Representationmentioning

confidence: 94%

Section: Text Visual Question Answeringmentioning

confidence: 99%

“…Existing work (Hu et al, 2020) builds a common embedding space for all modalities. However, this common embedding space has difficulty utilizing the image object features.…”

Section: Context-enriched Ocr Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations