Abstract:This paper is concerned with sketching future directions for corpusbased dialectology. We advocate a holistic approach to the study of geographically conditioned linguistic variability, and we present a suitable methodology, 'corpusbased dialectometry', in exactly this spirit. Specifically, we argue that in order to live up to the potential of the corpus-based method, practitioners need to (i) abandon their exclusive focus on individual linguistic features in favor of the study of feature aggregates, (ii) draw… Show more
“…The resulting boxy shapes are in biology often interpreted as being indicative of horizontal gene transfer and in linguistics as suggesting language contact. We skip further technicalities and refer the reader to the introduction in Szmrecsanyi and Wolk (2011:574–577). Suffice it to say that we present neighbor-net diagrams without insisting on a strictly phylogenetic interpretation.…”
This study explores variability in particle placement across nine varieties of English around the globe, utilizing data from the International Corpus of English and the Global Corpus of Web-based English. We introduce a quantitative approach for comparative sociolinguistics that integrates linguistic distance metrics and predictive modeling, and use these methods to examine the development of regional patterns in grammatical constraints on particle placement in World Englishes. We find a high degree of uniformity among the conditioning factors influencing particle placement in native varieties (e.g., British, Canadian, and New Zealand English), while English as a second language varieties (e.g., Indian and Singaporean English) exhibit a high degree of dissimilarity with the native varieties and with each other. We attribute the greater heterogeneity among second language varieties to the interaction between general L2 acquisition processes and the varying sociolinguistic contexts of the individual regions. We argue that the similarities in constraint effects represent compelling evidence for the existence of a shared variable grammar and variation among grammatical systems is more appropriately analyzed and interpreted as a continuum rather than multiple distinct grammars.
“…The resulting boxy shapes are in biology often interpreted as being indicative of horizontal gene transfer and in linguistics as suggesting language contact. We skip further technicalities and refer the reader to the introduction in Szmrecsanyi and Wolk (2011:574–577). Suffice it to say that we present neighbor-net diagrams without insisting on a strictly phylogenetic interpretation.…”
This study explores variability in particle placement across nine varieties of English around the globe, utilizing data from the International Corpus of English and the Global Corpus of Web-based English. We introduce a quantitative approach for comparative sociolinguistics that integrates linguistic distance metrics and predictive modeling, and use these methods to examine the development of regional patterns in grammatical constraints on particle placement in World Englishes. We find a high degree of uniformity among the conditioning factors influencing particle placement in native varieties (e.g., British, Canadian, and New Zealand English), while English as a second language varieties (e.g., Indian and Singaporean English) exhibit a high degree of dissimilarity with the native varieties and with each other. We attribute the greater heterogeneity among second language varieties to the interaction between general L2 acquisition processes and the varying sociolinguistic contexts of the individual regions. We argue that the similarities in constraint effects represent compelling evidence for the existence of a shared variable grammar and variation among grammatical systems is more appropriately analyzed and interpreted as a continuum rather than multiple distinct grammars.
“…However, given thatas we have seenlanguage grouping in classificatory linguistics is intended to reflect systematic, pervasive change, researchers have increasingly questioned classifications that rely on linguistic traits selected a priori. While being a practical necessity in traditional comparative dialectology, the selection of a limited number of specific traits necessarily involves subjective judgements 2005;Starostin, 2010;Szmrecsanyi & Wolk, 2011), and may result in erroneous classifications as the pre-selected traits become overly influential in the final analysis. In keeping with this view, this paper aims to contribute to the development of an empirically-based classification of Gallo-Italic through the use of dialectometry applied to atlas corpora, and specifically through the measurement of Levenshtein distance.…”
While Gallo-Italic varieties clearly belong to the Romance language family, their subgrouping as either Gallo-Romance or Italo-Romance has been the source of disagreement in the classificatory literature. While earlier analyses tended to classify Gallo-Italic as Gallo-Romance (notably Schmid, 1956; Bec, 1970-1971), later work has either argued for or tacitly assumed a classification of Gallo-Italic as part of the Italo-Romance branch, a view that is both different from as well as irreconcilable with the earlier Gallo-Romance classifications. In this paper we aim to contribute to the development of an empirically-based classification of Gallo-Italic through the use of dialectometry applied to atlas corpora, and specifically through the measurement of Levenshtein distance. Using three wordlists (Swadesh 100, Swadesh 200, Leipzig-Jakarta) and comparing twenty-six linguistic varieties across Italy and southeastern France, we show that Gallo-Italic is best classified as a third subgroup within the Gallo-Romance branch. Our results also clearly identify all the major bundles of isoglosses established through traditional dialectological methods and confirm Gallo-Italic as a relatively homogenous group distinct from Italo-Romance.
“…C orpus-based dialectometry (henceforth: CBDM), then, combines the study of dialectometric research questions with corpus-linguistic methodologies. CBDM utilizes aggregation methodologies to explore quantitative and distributional usage patterns extracted from dialect corpora (see Szmrecsanyi, 2008, 2011, 2013; Szmrecsanyi & Wolk, 2011; Wolk, 2014; Wolk & Szmrecsanyi, 2016). Turning to corpora enables analysts to address questions about usage versus knowledge, production/comprehension versus intuition, chaos versus orderliness, and so on.…”
Researchers in dialectometry have begun to explore measurements based on
fundamentally quantitative metrics, often sourced from dialect corpora, as an
alternative to the traditional signals derived from dialect atlases. This change
of data type amplifies an existing issue in the classical paradigm, namely that
locations may vary in coverage and that this affects the distance measurements:
pairs involving a location with lower coverage suffer from greater noise and
therefore imprecision. We propose a method for increasing robustness using
generalized additive modeling, a statistical technique that allows leveraging
the spatial arrangement of the data. The technique is applied to data from the
British English dialect corpus FRED; the results are evaluated regarding their
interpretability and according to several quantitative metrics. We conclude that
data availability is an influential covariate in corpus-based dialectometry and
beyond, and recommend that researchers be aware of this issue and of methods to
alleviate it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.