Evaluating Inter-Annotator Agreement on Historical Spelling Normalization

Bollmann, Marcel; Dipper, Stefanie; Petran, Florian

doi:10.18653/v1/w16-1711

Cited by 3 publications

(4 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The issues related to normalization and annotation are equally applicable to the use of corpora in historical linguistics, sociolinguistics, dialectology, and, in a somewhat different way, language typology. In historical linguistics, token normalization (Azawi, Afzal, & Breuel, ; Bollmann, Dipper, & Petran, ; Bollmann, Petran, & Dipper, ; Jurish, ), sentence segmentation (Petran, ), and extensions of POS tagsets (Dipper et al., ) are actively discussed, which should support fruitful cross‐disciplinary insight for the analysis of learner corpora.…”

mentioning

confidence: 99%

Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration With Computational Linguistics

Meurers

2017

Language Learning

View full text Add to dashboard Cite

This article discusses two types of opportunities for interdisciplinary collaboration between computational linguistics (CL) and language learning research. We target the connection between data and theory in second language (L2) research and highlight opportunities to (a) enrich the options for obtaining data and (b) support the identification and valid interpretation of relevant learner data. We first characterize options, limitations, and potential for obtaining rich data on learning: from Web‐based intervention studies supporting the collection of experimentally controlled data to online workbooks facilitating large‐scale, longitudinal corpus collection for a range of learning tasks and proficiency levels. We then turn to the question of how corpus data can systematically be used for L2 research, focusing on the central role that linguistic corpus annotation plays in that regard. We show that learner language poses particular challenges to human and CL analysis and requires more interdisciplinary discussion of analysis frameworks and advances in annotation schemes.

show abstract

mentioning

confidence: 99%

Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration With Computational Linguistics

Meurers

2017

Language Learning

View full text Add to dashboard Cite

show abstract

“…For building block applications, adequate text preprocessing is necessary to leverage these NLP building blocks to their full potential (Thanaki 2017;Sarkar 2019), while improper choices in text preprocessing can hinder their performance (Reber 2019). For example, the accuracy of POS tagging can generally be improved through spelling normalization (Schuur 2020), especially in historical texts where archaic word forms are mapped to modern ones in the POS training database (Bollmann 2013). NER can benefit from the detection of multiword expressions, since an entity often contains more than one word (Tan and Pal 2014;Nayel et al 2019).…”

Section: Nlp Application Typesmentioning

confidence: 99%

“…This shows that punctuation provides grammatical information to POS tagging (Olde et al 1999). Note that inconsistent use of punctuation can be worse than no punctuation (Bollmann 2013), and in this case, discarding punctuation is preferable. Furthermore, using punctuation to separate text into shorter strings is helpful in machine translation, especially for long and complicated sentences (Yin et al 2007).…”

Section: Separating Punctuation From Stringsmentioning

confidence: 99%

Comparison of text preprocessing methods

Chai

2022

Nat. Lang. Eng.

View full text Add to dashboard Cite

Text preprocessing is not only an essential step to prepare the corpus for modeling but also a key area that directly affects the natural language processing (NLP) application results. For instance, precise tokenization increases the accuracy of part-of-speech (POS) tagging, and retaining multiword expressions improves reasoning and machine translation. The text corpus needs to be appropriately preprocessed before it is ready to serve as the input to computer models. The preprocessing requirements depend on both the nature of the corpus and the NLP application itself, that is, what researchers would like to achieve from analyzing the data. Conventional text preprocessing practices generally suffice, but there exist situations where the text preprocessing needs to be customized for better analysis results. Hence, we discuss the pros and cons of several common text preprocessing methods: removing formatting, tokenization, text normalization, handling punctuation, removing stopwords, stemming and lemmatization, n-gramming, and identifying multiword expressions. Then, we provide examples of text datasets which require special preprocessing and how previous researchers handled the challenge. We expect this article to be a starting guideline on how to select and fine-tune text preprocessing methods.

show abstract

“…Bollmann (2013) showed that even a very small amount of training data (250 manually normalised tokens) significantly raises the accuracy of PoS tagging (approximately 46% on a 15th-century German manuscript), indicating that the approach is especially useful for less-resourced language variants and that the process may be quite cost-effective.…”

Section: Introductionmentioning

confidence: 99%

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

2019

View full text Add to dashboard Cite

Part-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

show abstract

Evaluating Inter-Annotator Agreement on Historical Spelling Normalization

Cited by 3 publications

References 7 publications

Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration With Computational Linguistics

Evidence and Interpretation in Language Learning Research: Opportunities for Collaboration With Computational Linguistics

Comparison of text preprocessing methods

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Contact Info

Product

Resources

About