From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

Hämäläinen, Mika; Hengchen, Simon

doi:10.26615/978-954-452-056-4_051

Cited by 22 publications

(26 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Their findings suggest that post-processing is the most effective way of improving a character level NMT normalization model. The same method has been successfully applied in OCR post-correction as well (Hämäläinen and Hengchen, 2019).…”

Section: Related Workmentioning

confidence: 99%

Dialect Text Normalization to Normative Standard Finnish

Partanen¹,

Hämäläinen²,

Alnajjar³

2019

Proceedings of the 5th Workshop on Noisy User-Generated Text (W-Nut 2019)

Self Cite

View full text Add to dashboard Cite

We compare different LSTMs and transformer models in terms of their effectiveness in normalizing dialectal Finnish into the normative standard Finnish. As dialect is the common way of communication for people online in Finnish, such a normalization is a necessary step to improve the accuracy of the existing Finnish NLP tools that are tailored for normative Finnish text. We work on a corpus consisting of dialectal data from 23 distinct Finnish dialect varieties. The best functioning BRNN approach lowers the initial word error rate of the corpus from 52.89 to 5.73.

show abstract

Section: Related Workmentioning

confidence: 99%

Dialect Text Normalization to Normative Standard Finnish

Partanen¹,

Hämäläinen²,

Alnajjar³

2019

Proceedings of the 5th Workshop on Noisy User-Generated Text (W-Nut 2019)

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, LMs directly trained on large OCR'd corpora may still yield robust word vectors. They may even be able to position a word and its badly OCR'd variants nearby in the vector space (Hämäläinen and Hengchen, 2019). In such cases, LMs can be used to identify OCR errors and possibly provide a way to correct systematic OCR errors in a large corpus.…”

Section: Language Modelsmentioning

confidence: 99%

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Strien

Beelen

Ardanuy

et al. 2020

Proceedings of the 12th International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks -sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning -using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

show abstract

“…Since the models span several decades, they present an interesting view of words over time, useful for researchers interested in diachronic studies such as culturomics (Michel et al, 2011), semantic change (see Tahmasebi et al (2018); Kutuzov et al (2018), for overviews), historical research (van Eijnatten & Ros, 2019;Hengchen et al, 2021a;Marjanen et al, 2020), etc. They also can be further fed as input to more complex neural networks tackling downstream tasks aimed at historical data such as OCR post-correction (Hämäläinen & Hengchen, 2019;Duong et al, 2020) or more linguistics-oriented problems (Budts, 2020). Since we release the whole models and not solely the learned vectors, these models can be further trained and specialised, or used by NLP researchers to compare different space alignment procedures.…”

Section: Reuse Potentialmentioning

confidence: 99%

A Collection of Swedish Diachronic Word Embedding Models Trained on Historical Newspaper Data

Hengchen

Tahmasebi

2021

Journal of Open Humanities Data

Self Cite

View full text Add to dashboard Cite

This paper describes the creation of several word embedding models based on a large collection of diachronic Swedish newspaper material available through Språkbanken Text, the Swedish language bank. This data was produced in the context of Språkbanken Text's continued mission to collaborate with humanities and natural language processing (NLP) researchers and to provide freely available language resources, for the development of state-of-the-art NLP methods and tools.

show abstract

From the Paft to the Fiiture: a Fully Automatic NMT andWord Embeddings Method for OCR Post-Correction

Cited by 22 publications

References 15 publications

Dialect Text Normalization to Normative Standard Finnish

Dialect Text Normalization to Normative Standard Finnish

Assessing the Impact of OCR Quality on Downstream NLP Tasks

A Collection of Swedish Diachronic Word Embedding Models Trained on Historical Newspaper Data

Contact Info

Product

Resources

About