Spelling variation in non-standard language, e.g. computer-mediated communication and historical texts, is usually treated as a deviation from a standard spelling, e.g. 2mr as a non-standard spelling for tomorrow. Consequently, in normalization -the standard approach of dealing with spelling variation -so-called non-standard words are mapped to their corresponding standard words. However, there is not always a corresponding standard word. This can be the case for single types (like emoticons in computermediated communication) or a complete language, e.g. texts from historical languages that did not develop to a standard variety. The approach presented in this thesis proposal deals with spelling variation in absence of reference to a standard. The task is to detect pairs of types that are variants of the same morphological word. An approach for spelling-variant detection is presented, where pairs of potential spelling variants are generated with Levenshtein distance and subsequently filtered by supervised machine learning. The approach is evaluated on historical Low German texts. Finally, further perspectives are discussed.
This paper describes our contribution to two challenges in data-driven lemmatization. We approach lemmatization in the framework of a two-stage process, where first lemma candidates are generated and afterwards a ranker chooses the most probable lemma from these candidates. The first challenge is that languages with rich morphology like Modern German can feature morphological changes of different kinds, in particular word-internal modification. This makes the generation of the correct lemma a harder task than just removing suffixes (stemming). The second challenge that we address is spelling variation as it appears in non-standard texts. We experiment with different generators that are specifically tailored to deal with these two challenges. We show in an oracle setting that there is a possible increase in lemmatization accuracy of 14% with our methods to generate lemma candidates on Middle Low German, a group of historical dialects of German (1200-1650 AD). Using a log-linear model to choose the correct lemma from the set, we obtain an actual increase of 5.56%.
Sentence-internal capitalization of nouns is a characteristic of written Standard German. The sixteenth and seventeenth centuries have been identified as the crucial period for the development of this graphemic convention. Previous studies have shown that animacy played a major role in the spread of sentence-internal capitalization. On the basis of the transregional SiGS-corpus consisting of 18 protocols of witch trials (hand-)written between 1588 and 1630, we propose word frequency as an additional factor and test for its interaction with animacy. Our data reveal that the proportion of capitalized words denoting humans and animate concepts increases rapidly, while the capitalization of lexemes referring to concrete and abstract concepts remains stable at a lower level. A binomial mixed-effects model shows a highly significant effect of frequency and a significant interaction between frequency and animacy. In sum, our data show how cognitive, pragmatic, and usage factors conspire in the gradual emergence of a graphemic convention. We therefore argue that the previously neglected graphemic dimension can add important insights to an empirically based theory of the language-cognition interface.
Das Referenzkorpus Mittelniederdeutsch / Niederrheinisch (1200–1650) (kurz: ReN) enthält mittelniederdeutsche und niederrheinische Handschriften, Drucke und Inschriften, die diplomatisch transkribiert, lemmatisiert und grammatisch annotiert sind und unter anderem über das Such- und Visualisierungstool ANNIS genutzt werden können. Das ReN bietet mit diesen Daten die Grundlage für Analysen auf unterschiedlichen Sprachebenen und liefert damit einen entscheidenden Beitrag für die Erarbeitung einer neuen wissenschaftlichen mittelniederdeutschen Grammatik. Um Einblicke in die tatsächlichen sprachlichen Gegebenheiten und die Verbreitung syntaktischer Phänomene zu erhalten, sind Analysen in einem umfangreichen und strukturierten Korpus wie dem ReN unabdingbar. Anhand zweier Beispiele, der wēsen-Periphrase und dem tô-Infinitiv, soll gezeigt werden, wie das ReN mithilfe von Suchabfragen in ANNIS für syntaktische Analysen genutzt werden kann.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.