2021
DOI: 10.48550/arxiv.2109.06264
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models

Abstract: In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample-and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
0
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
references
References 8 publications
0
0
0
Order By: Relevance