A Large-Scale Comparison of Historical Text Normalization Systems

Bollmann, Marcel

doi:10.18653/v1/n19-1389

Cited by 51 publications

(47 citation statements)

References 23 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…First, inputs can be extremely noisy, with errors which do not resemble tweet misspellings or speech transcription hesitations, for which adapted approaches have already been devised [5,27]. Second, the language under study is mostly of earlier stage(s), which renders usual external and internal evidences less effective (e.g., the usage of different naming conventions and presence of historical spelling variations) [2,3]. Further, beside historical VIPs, texts from the past contain rare entities which have undergone significant changes (esp.…”

Section: Motivation and Objectivesmentioning

confidence: 99%

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Ehrmann

Romanello

Bircher

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Since its introduction some twenty years ago, named entity (NE) processing has become an essential component of virtually any text mining application and has undergone major changes. Recently, two main trends characterise its developments: the adoption of deep learning architectures and the consideration of textual material originating from historical and cultural heritage collections. While the former opens up new opportunities, the latter introduces new challenges with heterogeneous, historical and noisy inputs. If NE processing tools are increasingly being used in the context of historical documents, performance values are below the ones on contemporary data and are hardly comparable. In this context, this paper introduces the CLEF 2020 Evaluation Lab HIPE (Identifying Historical People, Places and other Entities) on named entity recognition and linking on diachronic historical newspaper material in French, German and English. Our objective is threefold: strengthening the robustness of existing approaches on non-standard inputs, enabling performance comparison of NE processing on historical texts, and, in the long run, fostering efficient semantic indexing of historical documents in order to support scholarship on digital cultural heritage collections.

show abstract

Section: Motivation and Objectivesmentioning

confidence: 99%

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Ehrmann

Romanello

Bircher

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…In order to adapt text written in standard Finnish to dialects, we train several different models on the data set. As a character level sequence-to-sequence neural machine translation (NMT) approach has been proven successful in the past for the opposite problem of normalization of dialectal or historical language variant to the standard language (see (Bollmann 2019;; Veliz, De Clercq, and Hoste 2019; Hämäläinen and Hengchen 2019)), we approach the problem form a similar character based methodology. The advantage of character level models to word level models is their adaptability to out of vocabulary words; a requirement which needs to be satisfied for our experiments to be successful.…”

Section: Automatic Dialect Adaptationmentioning

confidence: 99%

Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Hämäläinen¹,

Partanen²,

Alnajjar³

et al. 2020

Preprint

View full text Add to dashboard Cite

We present a novel approach for adapting text written in standard Finnish to different dialects. We experiment with character level NMT models both by using a multi-dialectal and transfer learning approaches. The models are tested with over 20 different dialects. The results seem to favor transfer learning, although not strongly over the multi-dialectal approach. We study the influence dialectal adaptation has on perceived creativity of computer generated poetry. Our results suggest that the more the dialect deviates from the standard Finnish, the lower scores people tend to give on an existing evaluation metric. However, on a word association test, people associate creativity and originality more with dialect and fluency more with standard Finnish.

show abstract

“…Most important for our task is dialectal text normalization, but for the sake of thoroughness we discuss the related work in a somewhat wider context. Bollmann [5] has provided a meta-analysis where contemporary approaches are divided into five categories: substitution lists like VARD [22] and Norma [4], rule-based methods [2,21], edit distance based approaches [1,12], statistical methods and most recently neural methods.…”

Section: Related Workmentioning

confidence: 99%

Normalization of Different Swedish Dialects Spoken in Finland

Hämäläinen

Partanen

Alnajjar

2020

Proceedings of the 4th ACM SIGSPATIAL Workshop on Geospatial Humanities

View full text Add to dashboard Cite

Our study presents a dialect normalization method for different Finland Swedish dialects covering six regions. We tested 5 different models, and the best model improved the word error rate from 76.45 to 28.58. Contrary to results reported in earlier research on Finnish dialects, we found that training the model with one word at a time gave best results. We believe this is due to the size of the training data available for the model. Our models are accessible as a Python package. The study provides important information about the adaptability of these methods in different contexts, and gives important baselines for further study. CCS CONCEPTS • Computing methodologies → Natural language processing; • Applied computing → Arts and humanities.

show abstract

A Large-Scale Comparison of Historical Text Normalization Systems

Cited by 51 publications

References 23 publications

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Introducing the CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on Historical Newspapers

Automatic Dialect Adaptation in Finnish and its Effect on Perceived Creativity

Normalization of Different Swedish Dialects Spoken in Finland

Contact Info

Product

Resources

About