Abstract:In this paper we apply various clustering algorithms to the dialect pronunciation data. At the same time we propose several evaluation techniques that should be used in order to deal with the instability of the clustering techniques. The results have shown that three hierarchical clustering algorithms are not suitable for the data we are working with. The rest of the tested algorithms have successfully detected two-way split of the data into the Eastern and Western dialects. At the aggregate level that we used… Show more
“…see Prokić & Nerbonne, 2008;Nerbonne & Heeringa, 2009;Wieling & Nerbonne, 2010;Grieve et al, 2011). In particular, Ward's method for hierarchical clustering (Ward, 1963) was used because it tends to identify the clearest dialect regions and because it is one of the most common methods for hierarchical clustering in dialectometry.…”
This paper presents the results of a multivariate spatial analysis of thirty-eight vowel formant variables measured in 236 cities from across the contiguous United States, based on the acoustic data from the Atlas of North American English. The results of the analysis both confirm and challenge the results of the Atlas. Most notably, while the analysis identifies similar patterns as the Atlas in the West and the Southeast, the analysis finds that the Midwest and the Northeast are distinct dialect regions that are considerably stronger than the traditional Midland dialect region identified in the Atlas. The analysis also finds evidence that a vowel shift is actively shaping the language of the Western United States.
“…see Prokić & Nerbonne, 2008;Nerbonne & Heeringa, 2009;Wieling & Nerbonne, 2010;Grieve et al, 2011). In particular, Ward's method for hierarchical clustering (Ward, 1963) was used because it tends to identify the clearest dialect regions and because it is one of the most common methods for hierarchical clustering in dialectometry.…”
This paper presents the results of a multivariate spatial analysis of thirty-eight vowel formant variables measured in 236 cities from across the contiguous United States, based on the acoustic data from the Atlas of North American English. The results of the analysis both confirm and challenge the results of the Atlas. Most notably, while the analysis identifies similar patterns as the Atlas in the West and the Southeast, the analysis finds that the Midwest and the Northeast are distinct dialect regions that are considerably stronger than the traditional Midland dialect region identified in the Atlas. The analysis also finds evidence that a vowel shift is actively shaping the language of the Western United States.
“…However, previous studies have noted that three-dimensional MDS representation usually accounts for about 90% of the variation in the distance matrix and can thus be considered reliable (Heeringa, 2004;Prokić & Nerbonne, 2008 in an MDS analysis, this will be obvious in a low correlation between distances in the input matrix and distances in the inferred two-or three-dimensional solution.…”
The calculation of aggregate linguistic distances can compensate for some of the drawbacks inherent to the isogloss bundling method used in traditional dialectology to identify dialect areas. Synchronic aggregate analysis can also point out differences with respect to a diachronically based classification of dialects. In this study the Levenshtein algorithm is applied for the first time to obtain an aggregate analysis of the linguistic distances among 88 diatopic varieties of Croatian spoken along the Eastern Adriatic coast and in the Italian province of Molise. We also measured lexical differences among these varieties, which are traditionally grouped into Čakavian, Štokavian, and transitional Čakavian-Štokavian varieties. The lexical and pronunciational distances are subsequently projected onto multidimensional cartographic representations. Both kinds of analyses confirmed that linguistic discontinuity is characteristic of the whole region, and that discontinuities are more pronounced in the northern Adriatic area than in the south. We also show that the geographic lines are in many cases the most decisive factor contributing to linguistic cohesion, and that the internal heterogeneity within Čakavian is often greater than the differences between Čakavian and Štokavian varieties. This holds both for pronunciation and lexicon. 2
IntroductionOne of the most popular methods applied in traditional geolinguistics (dialectology) is the method of isoglosses, in which areas characterized by different realizations of a single feature are separated by a line -an isogloss. Bundles of such lines were traditionally considered the most important criterion for the division of geolinguistic space into linguistic areas. Despite the tendency to rely on the application of this method in traditional dialectology, even there it has long been recognized that isoglosses do not determine dialectal areas unambiguously because they rarely coincide completely. The isogloss method needs additional assumptions to account for transitional zones and/or dialect continua, even though these are widely recognized to be as common as tightlyknit and readily definable linguistic areas (Chambers & Trudgill, 1998:97).Brozović, who is aware of the problem, argues that in the case of Croatian, because of specific features of the dialectological make-up of this language, the use of traditional isogloss method is nevertheless sometimes justified: "In our linguistic territory we often find the kind of clear-cut dialectal boundaries that older dialectologists could only dream of; these boundaries occur with intense, clear and dense bundles of isoglosses, whereas it has long been clear to dialectologists that such 'ideal' dialectal boundaries are not a common occurrence in language. " (1970:9) 1 . It is our opinion, however, that the division of the Croatian language area into dialect groups is still problematic. This is because although clear-cut dialectal boundaries might be found often in Croatia, they are by no means the rule as Brozović (1970...
“…Before generating a linguistic distance matrix, Cronbach's alpha was used to measure the internal consistency of the linguistic variables in the regional linguistic data matrix (Nerbonne and Heeringa, 1997;Heeringa et al, 2002;Heeringa, 2004;Nerbonne, 2008;Szmrecsanyi, 2008;Spruit et al, 2009). Cronbach's alpha was originally developed to assess if a set of items in a psychometric test measure the same underlying construct based on the scores on the test items for a sample of test takers (Cronbach, 1951).…”
Section: Cronbach's Alphamentioning
confidence: 99%
“…While the multidimensional scaling identifies continuous patterns of aggregated regional linguistic variation, a cluster analysis can be used to produce a discrete classification of the locations, which can then be mapped in order to identify absolute patterns of aggregated regional linguistic variation. In this analysis, the linguistic distance matrix was subjected to a hierarchical cluster analysis (Shackleton, 2005(Shackleton, , 2007Goebl, 2007;Prokic & Nerbonne, 2008;Wieling & Nerbonne, 2010). A hierarchical cluster analysis identifies clusters of similar objects in a distance matrix by initially assigning each observation to its own cluster and by then repeatedly combining the two most similar clusters to form larger and larger clusters until all of the objects have been combined to form one large cluster.…”
Section: Linguistic Distance Mapsmentioning
confidence: 99%
“…A hierarchical cluster analysis identifies clusters of similar objects in a distance matrix by initially assigning each observation to its own cluster and by then repeatedly combining the two most similar clusters to form larger and larger clusters until all of the objects have been combined to form one large cluster. Various methods exist for measuring the similarity between clusters consisting of multiple observations, but Ward's method (Ward, 1963) was used here because it is a common approach to clustering, which has been found to perform well in dialectometry (Prokic & Nerbonne, 2008) and which tends to produce clear and compact clusters. The results of the cluster analysis are represented by a tree diagram called a dendrogram, which shows the order in which the clusters were formed, and which can be used to identify clusters and sub-clusters of observations in the dataset.…”
This paper compares two statistical approaches to the analysis of aggregated regional linguistic variation. In the standard approach to dialectometry, common patterns of regional variation are identified by analyzing the distance between a set of locations based on the values of a set of linguistic variables using statistics such as multivariate scaling.Alternatively, in a multivariate spatial analysis, common patterns of regional variation are identified by using a factor analysis to analyze correlations between linguistic variables that are first smoothed using a local spatial autocorrelation analysis that identifies underlying spatial patterns in the values of each variable. To compare these approaches, both methods are used to analyze the acoustic vowel data from the Atlas of North American English. It is concluded that the multivariate spatial analysis identifies clearer and more detailed patterns of aggregated regional linguistic variation than the standard approach to dialectometry.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.