Correcting Whitespace Errors in Digitized Historical Texts

Soni, Sandeep; Klein, Lauren F.; Eisenstein, Jacob

doi:10.18653/v1/w19-2513

Cited by 7 publications

(11 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some methods of this type rely on Google Web 1T ngram corpus [18] for fixing errors, i.e., [9,21,147]. Bassil et al [9] first identify non-word errors using word unigram frequency.…”

Section: Context-dependent Approachesmentioning

confidence: 99%

“…Last, they choose the best alternative for each detected error relying on word 5-gram frequency. Soni et al [147] concentrate on handling segmentation errors via Google 1T Web ngrams. They determine whether a token should be segmented based on the probability of word unigram, and bigram.…”

Section: Context-dependent Approachesmentioning

confidence: 99%

“…All of them are absolutely valuable resources for further studies in this field. We present here a list of open sources: a free cloud service for OCR [16], 25 Hämäläinen and Hengchen [58], 26 Ochre, 27 OpenOCRCorrect [138], 28 PoCoTo [159], 29 Schnober et al [141], 30 Silfverberg et al [145], 31 Smith [41], 32 Soni et al [147], 33 and Schaefer and Neudecker [140]. 34 (4) Statistical Machine Translation: Moses SMT [80], 63 GIZA++, 64 ISI ReWrite Decoder.…”

Section: Language Resources and Toolkitsmentioning

confidence: 99%

See 2 more Smart Citations

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

Optical character recognition (OCR) is one of the most popular techniques used for converting printed documents into machine-readable ones. While OCR engines can do well with modern text, their performance is unfortunately significantly reduced on historical materials. Additionally, many texts have already been processed by various out-of-date digitisation techniques. As a consequence, digitised texts are noisy and need to be post-corrected. This article clarifies the importance of enhancing quality of OCR results by studying their effects on information retrieval and natural language processing applications. We then define the post-OCR processing problem, illustrate its typical pipeline, and review the state-of-the-art post-OCR processing approaches. Evaluation metrics, accessible datasets, language resources, and useful toolkits are also reported. Furthermore, the work identifies the current trend and outlines some research directions of this field.

show abstract

“…Some methods of this type rely on Google Web 1T ngram corpus [18] for fixing errors, i.e., [9,21,147]. Bassil et al [9] first identify non-word errors using word unigram frequency.…”

Section: Context-dependent Approachesmentioning

confidence: 99%

Section: Context-dependent Approachesmentioning

confidence: 99%

Section: Language Resources and Toolkitsmentioning

confidence: 99%

See 1 more Smart Citation

Survey of Post-OCR Processing Approaches

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Our approach to correcting these errors is described in prior work. 79 Deduplication. The collection also contains a number of articles that were reprinted verbatim from other newspapers (e.g.…”

Section: Data Processingmentioning

confidence: 99%

Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers

Soni¹,

Klein²,

Eisenstein³

2021

Journal of Cultural Analytics

Self Cite

View full text Add to dashboard Cite

The abolitionist movement of the nineteenthcentury United States remains among the most significant social and political movements in US history. Abolitionist newspapers played a crucial role in spreading information and shaping public opinion around a range of issues relating to the abolition of slavery. These newspapers also serve as a primary source of information about the movement for scholars today, resulting in powerful new accounts of the movement and its leaders. This paper supplements recent qualitative work on the role of women in abolition's vanguard, as well as the role of the Black press, with a quantitative text modeling approach. Using diachronic word embeddings, we identify which newspapers tended to lead lexical semantic innovations-the introduction of new usages of specific words-and which newspapers tended to follow. We then aggregate the evidence across hundreds of changes into a weighted network with the newspapers as nodes; directed edge weights represent the frequency with which each newspaper led the other in the adoption of a lexical semantic change. Analysis of this network reveals pathways of lexical semantic influence, distinguishing leaders from followers, as well as others who stood apart from the semantic changes that swept through this period. More specifically, we find that two newspapers edited by women-The Provincial Freeman and The Lily-led a large number of semantic changes in our corpus, lending additional credence to the argument that a multiracial coalition of women led the abolitionist movement in terms of both thought and action. It also contributes additional complexity to the scholarship that has sought to tease apart the relation of the abolitionist movement to the women's suffrage movement, and the vexed racial politics that characterized their relation.

show abstract

“…Tokenization errors are typical in texts that are digitized by Optical Character Recognition (OCR) techniques. For example, tokenization errors are known to be frequent in the ACL anthology corpus (Nastase and Hitschler, 2018) and in digitized newspapers (Soni et al, 2019;Adesam et al, 2019). Many OCR error correction methods can not deal with tokenization errors, and it is stated in Hämäläinen and Hengchen (2019) that: "A limitation of our approach is that it cannot do word segmentation in case multiple words have been merged together as a result of the OCR process.…”

Section: Sources Of Tokenization Errorsmentioning

confidence: 99%

Tokenization Repair in the Presence of Spelling Errors

Bast¹,

Hertel²,

Mohamed³

2021

Proceedings of the 25th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair".We identify three key ingredients of highquality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word.We provide six benchmarks that cover three use cases (OCR errors, text extraction from PDF, human errors) and the cases of partially correct space information and all spaces missing. We evaluate our methods against the best existing methods and a non-trivial baseline.We provide full reproducibility under https://ad.cs.uni-freiburg. de/publications . https:// ad-blog.cs.uni-freiburg.de/post/ tokenization-repair-using-transformers.

show abstract

Correcting Whitespace Errors in Digitized Historical Texts

Cited by 7 publications

References 6 publications

Survey of Post-OCR Processing Approaches

Survey of Post-OCR Processing Approaches

Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers

Tokenization Repair in the Presence of Spelling Errors

Contact Info

Product

Resources

About