Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities An 2019
DOI: 10.18653/v1/w19-2513
|View full text |Cite
|
Sign up to set email alerts
|

Correcting Whitespace Errors in Digitized Historical Texts

Abstract: Whitespace errors are common to digitized archives. This paper describes a lightweight unsupervised technique for recovering the original whitespace. Our approach is based on count statistics from Google n-grams, which are converted into a likelihood ratio test computed from interpolated trigram and bigram probabilities. To evaluate this approach, we annotate a small corpus of whitespace errors in a digitized corpus of newspapers from the 19th century United States. Our technique identifies and corrects most w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
11
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 6 publications
0
11
0
Order By: Relevance
“…Some methods of this type rely on Google Web 1T ngram corpus [18] for fixing errors, i.e., [9,21,147]. Bassil et al [9] first identify non-word errors using word unigram frequency.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
See 2 more Smart Citations
“…Some methods of this type rely on Google Web 1T ngram corpus [18] for fixing errors, i.e., [9,21,147]. Bassil et al [9] first identify non-word errors using word unigram frequency.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
“…Last, they choose the best alternative for each detected error relying on word 5-gram frequency. Soni et al [147] concentrate on handling segmentation errors via Google 1T Web ngrams. They determine whether a token should be segmented based on the probability of word unigram, and bigram.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…Our approach to correcting these errors is described in prior work. 79 Deduplication. The collection also contains a number of articles that were reprinted verbatim from other newspapers (e.g.…”
Section: Data Processingmentioning
confidence: 99%
“…Tokenization errors are typical in texts that are digitized by Optical Character Recognition (OCR) techniques. For example, tokenization errors are known to be frequent in the ACL anthology corpus (Nastase and Hitschler, 2018) and in digitized newspapers (Soni et al, 2019;Adesam et al, 2019). Many OCR error correction methods can not deal with tokenization errors, and it is stated in Hämäläinen and Hengchen (2019) that: "A limitation of our approach is that it cannot do word segmentation in case multiple words have been merged together as a result of the OCR process.…”
Section: Sources Of Tokenization Errorsmentioning
confidence: 99%