Using data from a 100-million-word representative corpus and a large-scale acceptability survey, we have investigated the relationship between corpus data and acceptability judgments. We conclude that the relative proportions of morphosyntactic variants in a corpus are the most significant predictor of a variant's acceptability to native speakers, and that in particular high relative proportions of one variant in a corpus are reliable indicators of high acceptability to native speakers. At the same time we note the limits of this predictability: low-frequency items, as noted elsewhere in the literature, often enjoy high levels of acceptability. Statistical preemption thus appears as a more limited phenomenon than had heretofore been posited.
Data from the Czech National Corpus and a large-scale survey of acceptability judgments are used to investigate the scope of morphosyntactic variation in two cases (genitive singular and locative singular) of a Czech declension pattern. The syntactic construction in which a form is found is shown to have a significant interaction with its frequency in the corpus and with its acceptability rating. We conclude that the pattern of acceptability preferences lends support to the entrenchment hypothesis and in general to emergentist approaches to language.Аннотация В настоящей статье рассматриваются отношения между данными из Национального Корпуса чешского языка и широким опросом оценки языковой приемлемости. Целью работы является рассмотрение масштабов морфосинтакси-ческой вариативности в двух чешских падежах (в родительном и локативном падежах единственного числа). Согласно результатам нашего анализа, синтаксическая конструкция, в которой имеется данная форма, состоит в тесном взаимодействии с ее частотностью в корпусе и с оценкой ее приемлемости. Таким образом, общая модель оценок приемлемости подтверждает гипотезу об «усилении» употребляемости более частых форм и в целом сходится с так называемыми «эмергентными» подходами к языку, т.е. с такими подходами, согласно которым созидание языковых структур происходит в ходе освоения языка.
Abstract:If we can operationalize corpus frequency in multiple ways, using absolute values and proportional values, which of them is more closely connected with the behaviour of language users? In this contribution, we examine overabundant cells in morphological paradigms, and look at the contribution that frequency of occurrence can make to understanding the choices speakers make due to this richness. We look at ways of operationalizing the term frequency in data from corpora and native speakers: the proportional frequency of forms (i.e. percentage of time that a variant is found in corpus data considered as a proportion of all variants) and several interpretations of absolute frequency (i.e. the raw frequency of variants in data from the same corpus). Working with data from unmotivated morphological variation in Czech case forms, we show that different instantiations of frequency help interpret the way variation is perceived and maintained by native speakers. Proportional frequency seems most salient for speakers in forming their judgements, while certain types of absolute frequency seem to have a dominant role in production tasks.Key words: corpus linguistics, frequency, morphology, empirical research, surveys, questionnaires, Czech, overabundance IntroductionFrequency data are familiar territory for any linguist who works with corpora. We cite the number of times a feature appears, or its normalized frequency if we are comparing corpora; we cite percentages to show structure within categories or to demonstrate change over time. Hidden behind the way we deal with these data is an implicit operationalization of our questions about language. We have chosen to let the corpus stand in for a particular language, type of language, genre, etc., but at the same time we have also chosen representations of frequency that give us the best chance of answering our research questions. It is worth interrogating these differing operationalizations of frequency to see how the same data, approached in different ways, can shed a different light on the way native speakers apprehend and use language.The term frequency is elastic, and once we start looking at frequency data there are few limits to the number of ways we can treat it. Divjak (2016) considers, among other meanings, the traditional relative frequency (incidence per million), construction frequency (which itself covers various ways of relating the frequencies between related items), family frequency (incorporating various ways of looking at the size and composition of a class of words) and measures of probability and association. These will largely be beyond the scope of this study, which is focused on how we understand and manipulate the numbers that arise from simple counts of individual forms.Our material comes from three sources. We have data from the Czech National Corpus (CNC) on the frequencies of forms occupying a single morphological "slot". We selected items
This is a prepublication draft of an article published in Russian Linguistics 2015 (3). Please cite from the published version, which differs from this one in some respects.Abstract: This article looks at inter-speaker variation in two environments: the genitive and locative singular cases of masculine "hard inanimate" nouns in юzechз using a largescale survey of native speakers that tested their preferences for certain forms and their choices. Our hypothesis that such variation exists was upheld, but only within limited parameters. Most biographical data (age, gender, education) played no role in respondents' choices or preferencesй Their region of origin played a small but significant role, although not the one expected. Relating the two types of tasks to each other, we found that respondents' use of the ratings scale did not correlate to their choice of formsз but their overall strength of preference for one form over another did correlate with their choices. Inter-speaker variation does thus go some way to explaining the persistent diversity in this paradigm and arguably may contribute to its maintenance.
In our contribution, we consider how corpus data can be used as a proxy for the written language environment around us in constructing offline studies of native-speaker intuition and usage. We assume a broadly emergent perspective on language: in other words, the linguistic competence of individuals is not identical or hard-wired, but forms gradually through exposure and coalescence of patterns of production and reaction. We hypothesize that while users presumably all in theory have access to the same linguistic material, their actual exposure to it and their ability to interpret it may differ, which will result in differing judgements and outputs. Our study looks at the interaction between corpus frequency and two possible indicators of individual difference: attitude towards reading tasks and performance on reading tasks. We find a small but consistent effect of task performance on respondents' judgements, but do not confirm any effects on respondents' production tasks. 1. Introduction1 Considerable attention has been devoted to whether all native speakers of a language access the same linguistic structures and material in similar ways, and whether, having accessed it, their use of and reaction to language (what we will call linguistic behavior) differ as well in predictable ways. There is accumulating evidence that intra-speaker variation can point to differences in linguistic behavior that are not random or insignificant.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.