TextCaps: A Dataset for Image Captioning with Reading Comprehension

Sidorov, Oleksii; Hu, Ronghang; Rohrbach, Marcus; Singh, Amanpreet

doi:10.1007/978-3-030-58536-5_44

Cited by 141 publications

(146 citation statements)

References 25 publications

Supporting

Mentioning

112

Contrasting

Order By: Relevance

“…Dognin et al (2020) recently discussed their winning entry to the VizWiz Grand Challenge. In addition, Sidorov et al (2020) introduced a model that has shown to gain significant performance improvement by using OCR tokens. We intend to compare our model with these and improve our work based on the observations made.…”

Section: Discussionmentioning

confidence: 99%

“…It has also been used in image captioning to aid learning novel objects (Yao et al, 2017;Li et al, 2019). Also, Sidorov et al (2020) introduced an M4C model that recognizes text, relates it to its visual context, and decides what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities such as objects.…”

Section: Createdmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Modal Image Captioning for the Visually Impaired

Ahsan¹,

Bhatt²,

Shah³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Rese

View full text Add to dashboard Cite

One of the ways blind people understand their surroundings is by clicking images and relying on descriptions generated by image captioning systems. Current work on captioning images for the visually impaired do not use the textual data present in the image when generating captions. This problem is critical as many visual scenes contain text. Moreover, up to 21% of the questions asked by blind people about the images they click pertain to the text present in them (Bigham et al., 2010). In this work, we propose altering AoANet, a state-of-the-art image captioning model, to leverage the text detected in the image as an input feature. In addition, we use a pointer-generator mechanism to copy the detected text to the caption when tokens need to be reproduced accurately. Our model outperforms AoANet on the benchmark dataset VizWiz, giving a 35% and 16.2% performance improvement on CIDEr and SPICE scores, respectively..

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Createdmentioning

confidence: 99%

Multi-Modal Image Captioning for the Visually Impaired

Ahsan¹,

Bhatt²,

Shah³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Rese

View full text Add to dashboard Cite

show abstract

“…(3) Text-based reading comprehension. TextCaps [49] and text-based VQA [50,3] show the new vision-and-language tasks, which need to recognize text, relate it to its visual context, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. Similarly, there are many application demands for video text understanding across various industries and in our daily lives.…”

Section: Link To Other Video-and-language Applicationsmentioning

confidence: 99%

A Bilingual, OpenWorld Video Text Dataset and End-to-end Video Text Spotter with Transformer

Wu¹,

Cai²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data. In this work, we introduce a large-scale, Bilingual, Open World Video text benchmark dataset(BOVText). There are four features for BOVText. Firstly, we provide 2,000+ videos with more than 1,750,000+ frames, 25 times larger than the existing largest dataset with incidental text in videos. Secondly, our dataset covers 30+ open categories with a wide selection of various scenarios, e.g., Life Vlog, Driving, Movie, etc. Thirdly, abundant text types annotation (i.e., title, caption or scene text) are provided for the different representational meanings in video. Fourthly, the BOVText provides bilingual text annotation to promote multiple cultures' live and communication. Besides, we propose an end-to-end video text spotting framework with Transformer, termed TransVTSpotter, which solves the multi-orient text spotting in video with a simple, but efficient attention-based query-key mechanism. It applies object features from the previous frame as a tracking query for the current frame and introduces a rotation angle prediction to fit the multiorient text instance. On ICDAR2015(video), TransVTSpotter achieves the state-of-the-art performance with 44.1% MOTA, 9 fps. The dataset and code of TransVTSpotter can be found at github.com/weijiawu/BOVText and github.com/weijiawu/TransVTSpotter, respectively. * Equal contribution. † This work was done when Weijia Wu were interns in MMU,

show abstract

“…The key interest of this dataset is detecting and annotating text generation errors from PLMs. Therefore it is different from conventional text generation datasets (e.g., Multi-News (Fabbri et al, 2019), TextCaps (Sidorov et al, 2020)) that are constructed to train models to learn text generation (e.g., generating texts from images or long documents). It is also different from grammatical error correction (GEC) datasets (Zhao et al, 2018;Flachs et al, 2020) that are built from human-written texts usually by second language learners.…”

Section: Introductionmentioning

confidence: 99%