Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation

Bollmann, Marcel; Petran, Florian; Dipper, Stefanie

doi:10.1007/978-3-319-14120-6_14

Cited by 5 publications

(6 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Over the years, researchers have proposed normalization methods based on rules and/or edit distances (Baron and Rayson, 2008;Bollmann, 2012;Hauser and Schulz, 2007;Bollmann et al, 2011;Pettersson et al, 2013a;Mitankin et al, 2014;Pettersson et al, 2014), statistical machine translation (Pettersson et al, 2013b;Scherrer and Erjavec, 2013), and most recently neural network models (Bollmann and Søgaard, 2016;Bollmann et al, 2017;Korchagina, 2017). However, most of these systems have been developed and tested on a single language (or even a single corpus), and many have not been compared to the naïve but strong baseline that only changes words seen in the training data, normalizing each to its most frequent modern form observed during training.…”

Section: Introductionmentioning

confidence: 99%

Evaluating Historical Text Normalization Systems: How Well Do They Generalize?

Robertson

Goldwater²

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

We highlight several issues in the evaluation of historical text normalization systems that make it hard to tell how well these systems would actually work in practice-i.e., for new datasets or languages; in comparison to more naïve systems; or as a preprocessing step for downstream NLP tools. We illustrate these issues and exemplify our proposed evaluation practices by comparing two neural models against a naïve baseline system. We show that the neural models generalize well to unseen words in tests on five languages; nevertheless, they provide no clear benefit over the naïve baseline for downstream POS tagging of an English historical collection. We conclude that future work should include more rigorous evaluation, including both intrinsic and extrinsic measures where possible.

show abstract

Section: Introductionmentioning

confidence: 99%

Evaluating Historical Text Normalization Systems: How Well Do They Generalize?

Robertson

Goldwater²

2018

Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu

View full text Add to dashboard Cite

show abstract

“…This approach works well. However, the same experiments showed lower performance on unknown wordforms [3], specially in some periods (time in history) the rules were varying. The rulebased approach was widely used in Information Retrieval applied for normalizing historical language data [6,1,19,12].…”

Section: Introductionmentioning

confidence: 64%

“…The goal of this research is to provide an automatic mapping from wordforms from Early New High German (14 th -16 th centuries) to the corresponding modern wordforms from New High German, as shown in Figure. 1. Bollman et al [3,4] compared different approaches for normalizing of variant word forms on their modern spelling using string distance measures and evaluate them on two types of historical text Luther bible and Anselm. These approaches are either rule-based or wordlist approaches.…”

Section: Introductionmentioning

confidence: 99%

Normalizing historical orthography for OCR historical documents using LSTM

Azawi

Afzal

Breuel

2013

Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

View full text Add to dashboard Cite

Historical text presents numerous challenges for contemporary different techniques, e.g. information retrieval, OCR and POS tagging. In particular, the absence of consistent orthographic conventions in historical text presents difficulties for any system which requires reference to a fixed lexicon accessed by orthographic form. For example, language modeling or retrieval engine for historical text which is produced by OCR systems, where the spelling of words often differ in various way, e.g. one word might have different spellings evolved over time. It is very important to aid those techniques with the rules for automatic mapping of historical wordforms. In this paper, we propose a new technique to model the target modern language by means of a recurrent neural network with long-short term memory architecture. Because the network is recurrent, the considered context is not limited to a fixed size especially due to memory cells which are designed to deal with long-term dependencies. In the set of experiments conducted on the Luther bible database and transform wordforms from Early New High German (ENHG) 14 th -16 th centuries to the corresponding modern wordforms in New High German (NHG). We compare our proposed supervised model LSTM to various methods for computing word alignments using statistical, heuristic models. Our new proposed LSTM outperforms the other three state-of-the-art methods. The evaluation shows the accuracy of our model on the known wordforms is 93.90% and on the unknown wordforms is 87.95%, while the accuracy of the existing state-of-the-art combined approach of the wordlist-based and rule-based normalization models is 92.93% for known and 76.88% for unknown tokens. Our proposed LSTM model outperforms on normalizing the modern wordform to historical wordform. The performance on seen tokens is 93.4%, while for unknown tokens is 89.17%.

show abstract

“…Text standardisation has been applied to historical text in languages such as English [5], French [6], German [7,8], Irish [9], Portuguese [10] and Slovene [11] to name but a few. In early work, researchers tended to adopt rule-based and edit-distancebased methods [12,13,14,15,16,17].…”

Section: Related Workmentioning

confidence: 99%

A Transformer-Based Orthographic Standardiser for Scottish Gaelic

Huang,

Alex,

Bauer

et al. 2023

2nd Annual Meeting of the ELRA/ISCA SIG on Under-Resourced Languages (SIGUL 2023)

View full text Add to dashboard Cite

The transition from rule-based to neural-based architectures has made it more difficult for low-resource languages like Scottish Gaelic to participate in modern language technologies. The performance of deep-learning approaches correlates with the availability of training data, and low-resource languages have limited data reserves by definition. Historical and non-standard orthographic texts could be used to supplement training data, but manual conversion of these texts is expensive and timeconsuming. This paper describes the development of a neuralbased orthographic standardisation system for Scottish Gaelic and compares it to an earlier rule-based system. The best performance yielded a precision of 93.92, a recall of 92.20 and a word error rate of 11.01. This was obtained using a transformerbased mixed teacher model which was trained with augmented data.

show abstract

Applying Rule-Based Normalization to Different Types of Historical Texts—An Evaluation

Cited by 5 publications

References 2 publications

Evaluating Historical Text Normalization Systems: How Well Do They Generalize?

Evaluating Historical Text Normalization Systems: How Well Do They Generalize?

Normalizing historical orthography for OCR historical documents using LSTM

A Transformer-Based Orthographic Standardiser for Scottish Gaelic

Contact Info

Product

Resources

About