Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1389
|View full text |Cite
|
Sign up to set email alerts
|

A Large-Scale Comparison of Historical Text Normalization Systems

Abstract: There is no consensus on the state-of-theart approach to historical text normalization. Many techniques have been proposed, including rule-based methods, distance metrics, character-based statistical machine translation, and neural encoder-decoder models, but studies have used different datasets, different evaluation methods, and have come to different conclusions. This paper presents the largest study of historical text normalization done so far. We critically survey the existing literature and report experim… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 51 publications
(47 citation statements)
references
References 23 publications
(26 reference statements)
0
37
0
1
Order By: Relevance
“…First, inputs can be extremely noisy, with errors which do not resemble tweet misspellings or speech transcription hesitations, for which adapted approaches have already been devised [5,27]. Second, the language under study is mostly of earlier stage(s), which renders usual external and internal evidences less effective (e.g., the usage of different naming conventions and presence of historical spelling variations) [2,3]. Further, beside historical VIPs, texts from the past contain rare entities which have undergone significant changes (esp.…”
Section: Motivation and Objectivesmentioning
confidence: 99%
“…First, inputs can be extremely noisy, with errors which do not resemble tweet misspellings or speech transcription hesitations, for which adapted approaches have already been devised [5,27]. Second, the language under study is mostly of earlier stage(s), which renders usual external and internal evidences less effective (e.g., the usage of different naming conventions and presence of historical spelling variations) [2,3]. Further, beside historical VIPs, texts from the past contain rare entities which have undergone significant changes (esp.…”
Section: Motivation and Objectivesmentioning
confidence: 99%
“…In order to adapt text written in standard Finnish to dialects, we train several different models on the data set. As a character level sequence-to-sequence neural machine translation (NMT) approach has been proven successful in the past for the opposite problem of normalization of dialectal or historical language variant to the standard language (see (Bollmann 2019;; Veliz, De Clercq, and Hoste 2019; Hämäläinen and Hengchen 2019)), we approach the problem form a similar character based methodology. The advantage of character level models to word level models is their adaptability to out of vocabulary words; a requirement which needs to be satisfied for our experiments to be successful.…”
Section: Automatic Dialect Adaptationmentioning
confidence: 99%
“…Most important for our task is dialectal text normalization, but for the sake of thoroughness we discuss the related work in a somewhat wider context. Bollmann [5] has provided a meta-analysis where contemporary approaches are divided into five categories: substitution lists like VARD [22] and Norma [4], rule-based methods [2,21], edit distance based approaches [1,12], statistical methods and most recently neural methods.…”
Section: Related Workmentioning
confidence: 99%