2022
DOI: 10.48550/arxiv.2203.08504
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Survey of Historical Document Image Datasets

Abstract: This paper presents a systematic literature review of image datasets for document image analysis, focusing on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms. However, because of the very large variety of the actual data (e.g., scripts, tasks, dates, support systems, and amount of deterioration), the different formats for data and lab… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
1
1
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 129 publications
0
2
0
Order By: Relevance
“…Consequently, developing a single system for analyzing and recognizing historical documents is improbable. The work in [223] has classified over 60 historic documents into the Classification, Structure and Analysis dataset. The [227] Marriage licence books (Hw) (Spanish) HR -READ-BAD [228] European archives(Roman) (Pr) LD 1470-1930 DIVA-HisDB [229] 3 Roman Medival scripts (Hw) DLA 11-14th cen VML-HD [230] Arabic books (Hw) Rec 1088-1451 Pinkas Hebrew scripts PS 1500-1800 Kuzushiji-MNIST Kuzushiji characters (Pr) CR -GRK-Papyri [231] Papyri scripts WI -Lontar Sunda Sudanese Palm scripts (Hw) Binz, HR, LD 15th cen Sunda AMADI LontarSet [232] Balinese Palm scripts (Hw) Binz, HR -Muscima++ [233] Music HR IAM-HistDB (St-Gall) [234] Roman scripts (Hw) DLA, Rec 9th cen IAM-HistDB (Parzival) [234] German scripts (Hw) DLA, Rec 13th cen IAM-HistDB (Washington) [234] English scripts (Hw) DLA, Rec 18th cen HJ Dataset [235] Japanese Biography scans (Hw, Pr) IR ARDIS [236] Swedish Digit (Hw) DR 18-19th documents like business letters or government PDFs, or medical documents like Diagnostic reports, pathology papers etc.…”
Section: A Historical Document Datasetsmentioning
confidence: 99%
“…Consequently, developing a single system for analyzing and recognizing historical documents is improbable. The work in [223] has classified over 60 historic documents into the Classification, Structure and Analysis dataset. The [227] Marriage licence books (Hw) (Spanish) HR -READ-BAD [228] European archives(Roman) (Pr) LD 1470-1930 DIVA-HisDB [229] 3 Roman Medival scripts (Hw) DLA 11-14th cen VML-HD [230] Arabic books (Hw) Rec 1088-1451 Pinkas Hebrew scripts PS 1500-1800 Kuzushiji-MNIST Kuzushiji characters (Pr) CR -GRK-Papyri [231] Papyri scripts WI -Lontar Sunda Sudanese Palm scripts (Hw) Binz, HR, LD 15th cen Sunda AMADI LontarSet [232] Balinese Palm scripts (Hw) Binz, HR -Muscima++ [233] Music HR IAM-HistDB (St-Gall) [234] Roman scripts (Hw) DLA, Rec 9th cen IAM-HistDB (Parzival) [234] German scripts (Hw) DLA, Rec 13th cen IAM-HistDB (Washington) [234] English scripts (Hw) DLA, Rec 18th cen HJ Dataset [235] Japanese Biography scans (Hw, Pr) IR ARDIS [236] Swedish Digit (Hw) DR 18-19th documents like business letters or government PDFs, or medical documents like Diagnostic reports, pathology papers etc.…”
Section: A Historical Document Datasetsmentioning
confidence: 99%
“…Several word embeddings are believed to be true in classifying the document [4]. As words cannot be relevant in all contexts, except they are perceived with dictionary meaning and related transitives, a word in a sentence will have the exact meaning in the context.…”
Section: Sentence Vectormentioning
confidence: 99%