Modernising historical Slovene words

Scherrer, Yves; Erjavec, Tomaž

doi:10.1017/s1351324915000236

Cited by 30 publications

(37 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…A more recent approach is based on characterbased statistical machine translation applied to historical text (Pettersson et al, 2013;Sánchez-Martínez et al, 2013;Scherrer and Erjavec, 2013; or dialectal data (Scherrer and Ljubešić, 2016). This is conceptually very similar to our approach, except that we substitute the classical SMT algorithms for neural networks.…”

Section: Related Workmentioning

confidence: 99%

Learning attention for historical text normalization by learning to pronounce

Bollmann

Bingel

Søgaard

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Automated processing of historical texts often relies on pre-normalization to modern word forms. Training encoder-decoder architectures to solve such problems typically requires a lot of training data, which is not available for the named task. We address this problem by using several novel encoder-decoder architectures, including a multi-task learning (MTL) architecture using a grapheme-to-phoneme dictionary as auxiliary data, pushing the state-of-theart by an absolute 2% increase in performance. We analyze the induced models across 44 different texts from Early New High German. Interestingly, we observe that, as previously conjectured, multi-task learning can learn to focus attention during decoding, in ways remarkably similar to recently proposed attention mechanisms. This, we believe, is an important step toward understanding how MTL works.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning attention for historical text normalization by learning to pronounce

Bollmann

Bingel

Søgaard

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…We experiment with both SMT and NMT implementations as contrastive methods. For our SMT pipeline, we employ a fairly standard array of tools, and set their parameters similarly to Scherrer and Erjavec (2013) and Scherrer and Ljubešić (2016). For alignment, we use MGIZA (Gao and Vogel, 2008) with grow-diag-final-and symmetrization.…”

Section: Experiments and Resultsmentioning

confidence: 99%

Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Çolakoğlu

Sulubacak²,

Tantuğ

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

View full text Add to dashboard Cite

With the growth of the social web, usergenerated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a tokenlevel pipeline of modules, heavily dependent on external linguistic resources and manuallydefined rules. Instead, we propose a fullyautomated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.

show abstract

“…Internal variation in the data is only dealt with indirectly by mapping the non-standard types to a corresponding standard type. Hence, it resembles a translation task, a framework in which normalization has been approached (Kobus et al, 2008;Scherrer and Erjavec, 2016). The task of detecting spelling variants shifts the attention towards the internal variation and resembles an information retrieval task where the aim is to detect unordered pairs of types like GML {jc, ik} which are used to realize the same morphological word.…”

Section: Related Workmentioning

confidence: 99%

Detecting spelling variants in non-standard texts

Barteld¹

2017

Proceedings of the Student Research Workshop at the 15th Conference Of the European Chapter of the Association for Co

View full text Add to dashboard Cite

Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as a non-standard spelling for tomorrow. Consequently, in normalization -the standard approach of dealing with spelling variation -so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computermediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.

show abstract

Modernising historical Slovene words

Cited by 30 publications

References 28 publications

Learning attention for historical text normalization by learning to pronounce

Learning attention for historical text normalization by learning to pronounce

Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Detecting spelling variants in non-standard texts

Contact Info

Product

Resources

About