2022
DOI: 10.48550/arxiv.2205.12029
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal Document Classification

Abstract: Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream approach. In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues, considering intra-and inter-modality relationships.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 28 publications
0
1
0
Order By: Relevance
“…Image-to-image translation for document images is different from natural scene images due to the presence of textual content in addition to the visual structure in the images [3,8]. Effectively enhancing these images requires not only the elimination of the background noise but doing so without losing the textual content.…”
Section: Introductionmentioning
confidence: 99%
“…Image-to-image translation for document images is different from natural scene images due to the presence of textual content in addition to the visual structure in the images [3,8]. Effectively enhancing these images requires not only the elimination of the background noise but doing so without losing the textual content.…”
Section: Introductionmentioning
confidence: 99%