In this study, we use the General Regionally Annotated Corpus of Ukrainian (GRAC, www.uacorpus.org) as an experimental field for testing stylometric approaches for variationist analysis. While, in the last years, quantitative methods such as binomial mixed-effects regression models as well as machine-learning methods such as random forests have gained considerable popularity in corpus linguistics, methods from stylometry have not been used for variation-linguistic analysis very often. Using data from GRAC, we show that a stylometric approach can be useful to analyze the diachronic development of Standard Ukrainian in the 20th century. We take departure from the two main variants of Standard Ukrainian used in the interwar period in Soviet Ukraine, on the one hand, and Western Ukraine as it was part of the Polish republic, on the other. We ask: what can stylometry tell us about how these standards differed and about their subsequent fate in enlarged Soviet Ukraine after WWII?Our analysis shows that certain specifically Western Ukrainian features common during the first decades of the 20th century did not find their way into the post-WWII standard, while others were retained. Moreover, we show that, by and large, stylometry shows a stronger continuity of the Eastern than the Western standard.Methodologically, we demonstrate that stylometry can be used as a tool to start corpus-linguistic research from a bird’s-eye view and in an inductive manner, without formulating any hypotheses regarding particular variables, and later zoom in on hitherto unknown variables representing regional or diachronic differences.
Quantitative, corpus based research on spontaneous spoken Carpathian Rusyn language can cause several data-related problems: Speakers are using ambivalent forms in different quantities, resulting in a biased data set – while a stricter data-cleaning process would lead to a large scale data loss. On top of that, polytomous categorical dependent variables are hard to analyze due to methodological limitations. This paper provides several approaches to face unbalanced and biased data sets containing variation of conjugational forms of the verb maty ‘to have’ and (po-)znaty ‘to know’ in Carpathian Rusyn language. Using resampling based methods like Cross-Validation, Bootstrapping and Random Forests, we provide a strategy for circumventing possible methodological pitfalls and gaining the most information from our precious data, without trying to p-hack the results. Calculating the predictive power of several sociolinguistic factors on linguistic variation, we can make valid statements about the (sociolinguistic) status of Rusyn and the stability of the old dialect continuum of Rusyn varieties.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.