Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.356
|View full text |Cite
|
Sign up to set email alerts
|

Cleaning Dirty Books: Post-OCR Processing for Previously Scanned Texts

Abstract: Substantial amounts of work are required to clean large collections of digitized books for NLP analysis, both because of the presence of errors in the scanned text and the presence of duplicate volumes in the corpora. In this paper, we consider the issue of deduplication in the presence of optical character recognition (OCR) errors. We present methods to handle these errors, evaluated on a collection of 19,347 texts from the Project Gutenberg dataset and 96,635 texts from the HathiTrust Library. We demonstrate… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 27 publications
0
2
0
Order By: Relevance
“…There is a growing interest in computational narrative analysis, ranging from analyzing the structure of narratives (Kim et al, , 2021, identifying important events in stories Keller, 2020, 2021;Papalampidi et al, 2020;Otake et al, 2020) to analyzing the relationship between characters in novels (Iyyer et al, 2016;Xanthos et al, 2016;Skorinkin, 2017;Azab et al, 2019;Labatut and Bost, 2019;Kubis, 2021;Brahman et al, 2021). The most relevant work to ours is Azab et al (2019), who apply word2vec (Mikolov et al, 2013) to learn character embeddings from movie scripts.…”
Section: Related Workmentioning
confidence: 99%
“…There is a growing interest in computational narrative analysis, ranging from analyzing the structure of narratives (Kim et al, , 2021, identifying important events in stories Keller, 2020, 2021;Papalampidi et al, 2020;Otake et al, 2020) to analyzing the relationship between characters in novels (Iyyer et al, 2016;Xanthos et al, 2016;Skorinkin, 2017;Azab et al, 2019;Labatut and Bost, 2019;Kubis, 2021;Brahman et al, 2021). The most relevant work to ours is Azab et al (2019), who apply word2vec (Mikolov et al, 2013) to learn character embeddings from movie scripts.…”
Section: Related Workmentioning
confidence: 99%
“…Analysis of books have been streamlined through pipelines like BookNLP (Bamman et al, 2014) as well as datasets of entities (Bamman et al, 2019). We also find other bookrelated works that improve the quality of books as well as understanding aspects such as time (Kim et al, , 2021. In particular, we focus on chap-ters and require proper chapter segmentation .…”
Section: Related Workmentioning
confidence: 99%