VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Bakkali, Souhail; Ming, Zuheng; Coustaty, Mickaël; Rusiñol, Marçal; Terrades, Oriol Ramos

doi:10.48550/arxiv.2205.12029

Cited by 1 publication

(1 citation statement)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Image-to-image translation for document images is different from natural scene images due to the presence of textual content in addition to the visual structure in the images [3,8]. Effectively enhancing these images requires not only the elimination of the background noise but doing so without losing the textual content.…”

Section: Introductionmentioning

confidence: 99%

Unpaired Document Image Denoising for OCR using BiLSTM enhanced CycleGAN

Singh

Tata

Oeveren

et al. 2023

Preprint

View full text Add to dashboard Cite

The recognition performance of optical character recognition (OCR) models can be sub-optimal when document images suffer from various degradations. Supervised deep learning methods for image enhancement can generate high-quality enhanced images. However, these methods demand the availability of corresponding clean images or ground truth text. Sometimes this requirement is difficult to fulfill for real-world noisy documents. For instance, it can be challenging to create paired noisy/clean training datasets or obtain ground truth text for noisy point-of-sale receipts and invoices. Unsupervised methods have been explored in recent years to enhance images in the absence of ground truth images or text. However, these methods focus on enhancing natural scene images. In the case of document images, preserving the readability of text in the enhanced images is of utmost importance for improved OCR performance. In this work, we propose a modified architecture to the CycleGAN model to improve its performance in enhancing document images with better text preservation. Inspired by the success of CNN-BiLSTM combination networks in text recognition models, we propose modifying the discriminator network in the CycleGAN model to a combined CNN-BiLSTM network for better feature extraction from document images during classification by the discriminator network. Results indicate that our proposed model not only leads to better preservation of text and improved OCR performance over the CycleGAN model but also achieves better performance than the classical unsupervised image pre-processing techniques like Sauvola and Otsu.

show abstract

Section: Introductionmentioning

confidence: 99%