Abstract:Researchers in dialectometry have begun to explore measurements based on
fundamentally quantitative metrics, often sourced from dialect corpora, as an
alternative to the traditional signals derived from dialect atlases. This change
of data type amplifies an existing issue in the classical paradigm, namely that
locations may vary in coverage and that this affects the distance measurements:
pairs involving a locat… Show more
“…Although, in comparison to atlases, they reveal more about the context and magnitude in which linguistic features are used, they come with their own issues. One problem is that the frequencies of the collected features typically need to be normalized to be comparable enough for dialectometrical analysis (Wolk & Szmrecsanyi, 2018). In the current work, we aim to surpass the issue by using transcribed interview data directly, without explicitly defining a list of features beforehand.…”
This paper presents a topic modeling approach to corpus-based dialectometry. Topic models are most often used in text mining to find latent structure in a collection of documents. They are based on the idea that frequently co-occurring words present the same underlying topic. In this study, topic models are used on interview transcriptions containing dialectal speech directly, without any annotations or preselected features. The transcriptions are modeled on complete words, on character n-grams, and after automatical segmentation. Data from three languages, Finnish, Norwegian, and Swiss German, are scrutinized. The proposed method is capable of discovering clear dialectal differences in all three datasets, while reflecting the differences between them. The method provides a significant simplification of the dialectometric workflow, simultaneously saving time and increasing objectivity. Using the method on non-normalized data could also benefit text mining, which is the traditional field of topic modeling.
“…Although, in comparison to atlases, they reveal more about the context and magnitude in which linguistic features are used, they come with their own issues. One problem is that the frequencies of the collected features typically need to be normalized to be comparable enough for dialectometrical analysis (Wolk & Szmrecsanyi, 2018). In the current work, we aim to surpass the issue by using transcribed interview data directly, without explicitly defining a list of features beforehand.…”
This paper presents a topic modeling approach to corpus-based dialectometry. Topic models are most often used in text mining to find latent structure in a collection of documents. They are based on the idea that frequently co-occurring words present the same underlying topic. In this study, topic models are used on interview transcriptions containing dialectal speech directly, without any annotations or preselected features. The transcriptions are modeled on complete words, on character n-grams, and after automatical segmentation. Data from three languages, Finnish, Norwegian, and Swiss German, are scrutinized. The proposed method is capable of discovering clear dialectal differences in all three datasets, while reflecting the differences between them. The method provides a significant simplification of the dialectometric workflow, simultaneously saving time and increasing objectivity. Using the method on non-normalized data could also benefit text mining, which is the traditional field of topic modeling.
The article uses the Q-learning algorithm to investigate the development of ecological language of college students in some cities, and analyzes the results of the investigation. Including the analysis of the language ability of the university, the analysis of the impact of the language environment of the students on the language ability, the analysis of the difference in language use and the analysis of the difference in language behavior. On this basis, summarizing the usage habits and behaviors of some students and giving solutions. In terms of social factors, analyzing the status quo of students and the Mandarin mode of college students. Analyzing the causes of college students’ “bilingualism” problems from the perspectives of sociolinguistics and psycholinguistics, improving language proficiency, and providing targeted solutions from the three perspectives of school, family, and individuals. The results show that only 9.9% of the respondents think their Mandarin is “very good,” only 19% of the respondents who can speak a little Mandarin think that their Mandarin is “very good.” Mandarin is very fluent, and the corresponding respondents rated their Mandarin as “very good” accounting for 32.1%. It can be seen that the Mandarin level of the surrounding contact objects has a great influence on the Mandarin level of the surveyed persons, and there is a positive correlation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.