ICDAR 2019 Competition on Scene Text Visual Question Answering

Biten, Ali Furkan; Tito, Rubèn; Mafla, Andrés; Gómez, Lluís; Rusiñol, Marçal; Mathew, Minesh; Jawahar, C. V.; Valveny, Ernest; Karatzas, Dìmosthenis

doi:10.1109/icdar.2019.00251

Cited by 34 publications

(22 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If not observe carefully, it's rather easy to obtain the wrong answer 2 instead of 3. The reasons for this error include object occlusion, near and far degrees, and the limitation (Biten et al, 2019), which requires to recognize the numbers, symbols and proper nouns in a scene. In Figure 7(c), subjective judgment is needed to answer the question is this man happy.…”

Section: Qualitative Analysismentioning

confidence: 99%

“…In Figure 7(b), the question what time should you pay can be answered by recognizing the text semantic understanding in the image. Text semantic understanding belongs to another task, namely text visual question answering(Biten et al, 2019), which requires to recognize the numbers, symbols and proper nouns in a scene. In Figure7(c), subjective judgment is needed to answer the question is this man happy.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

Huang¹,

Ji²,

Cai³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Visual question answering aims to answer the natural language question about a given image. Existing graph-based methods only focus on the relations between objects in an image and neglect the importance of the syntactic dependency relations between words in a question. To simultaneously capture the relations between objects in an image and the syntactic dependency relations between words in a question, we propose a novel dual channel graph convolutional network (DC-GCN) for better combining visual and textual advantages. The DC-GCN model consists of three parts: an I-GCN module to capture the relations between objects in an image, a Q-GCN module to capture the syntactic dependency relations between words in a question, and an attention alignment module to align image representations and question representations. Experimental results show that our model achieves comparable performance with the state-of-theart approaches.

show abstract

Section: Qualitative Analysismentioning

confidence: 99%

mentioning

confidence: 99%

Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

Huang¹,

Ji²,

Cai³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Scene text image recognition aims to recognize the text characters from the input image, which is an important computer vision task that involves text information processing. It has been widely used in text retrieval [25], sign recognition [17], license plate recognition [35] and other scene-text-based image understanding tasks [6,34]. However, due to the various issues such as low sensor resolution, blurring, poor illumination, etc., the quality of captured scene text images may not be good enough, which brings many difficulties to scene text recognition in practice.…”

Section: Introductionmentioning

confidence: 99%

Text Prior Guided Scene Text Image Super-resolution

Ma¹,

Guo²,

Zhang³

2021

Preprint

View full text Add to dashboard Cite

Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of lowresolution (LR) scene text images, and consequently boost the performance of text recognition. However, most of existing STISR methods regard text images as natural scene images, ignoring the categorical information of text. In this paper, we make an inspiring attempt to embed categorical text prior into STISR model training. Specifically, we adopt the character probability sequence as the text prior, which can be obtained conveniently from a text recognition model. The text prior provides categorical guidance to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR) framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not only effectively improve the visual quality of scene text images, but also significantly improve the text recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates certain generalization capability to the LR images in other datasets.

show abstract

“…In recent years, research topics around scene text have been very active [37,43,47]. Scene text-related research plays a very important role in many computer vision tasks [3,48]. However, imperfect imaging conditions often hinder the progress of these fields.…”

Section: Introductionmentioning

confidence: 99%

Scene Text Image Super-Resolution via Parallelly Contextual Attention Network

Zhao

Feng

Zhao

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Optical degradation blurs text shapes and edges, so existing scene text recognition methods have difficulties in achieving desirable results on low-resolution (LR) scene text images acquired in realworld environments. The above problem can be solved by efficiently extracting sequential information to reconstruct super-resolution (SR) text images, which remains a challenging task. In this paper, we propose a Parallelly Contextual Attention Network (PCAN), which effectively learns sequence-dependent features and focuses more on high-frequency information of the reconstruction in text images. Firstly, we explore the importance of sequence-dependent features in horizontal and vertical directions parallelly for text SR, and then design a parallelly contextual attention block to adaptively select the key information in the text sequence that contributes to image super-resolution. Secondly, we propose a hierarchically orthogonal texture-aware attention module and an edge guidance loss function, which can help to reconstruct high-frequency information in text images. Finally, we conduct extensive experiments on TextZoom dataset, and the results can be easily incorporated into mainstream text recognition algorithms to further improve their performance in LR image recognition. Besides, our approach exhibits great robustness in defending against adversarial attacks on seven mainstream scene text recognition datasets, which means it can also improve the security of the text recognition pipeline. Compared with directly recognizing LR images, our method can respectively improve the recognition accuracy of ASTER, MORAN,

show abstract

ICDAR 2019 Competition on Scene Text Visual Question Answering

Cited by 34 publications

References 23 publications

Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

Aligned Dual Channel Graph Convolutional Network for Visual Question Answering

Text Prior Guided Scene Text Image Super-resolution

Scene Text Image Super-Resolution via Parallelly Contextual Attention Network

Contact Info

Product

Resources

About