Lects in Helsinki Finnish - a probabilistic component modeling approach

Kuparinen, Olli; Peltonen, Jaakko; Mustanoja, Liisa; Leino, Antti; Santaharju, Jenni

doi:10.1017/s0954394521000041

Cited by 1 publication

(2 citation statements)

References 57 publications

(79 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latter would anyway be hard to detect in the examined dataset of US-American tweets. Kuparinen et al (2021) used a Latent Dirichlet Allocation model (see Section 3.3) to discover lects in Helsinki Finnish. Although the aim of the study was similar to the current one, the data was differently pre-processed.…”

Section: Topic Models and Dimensionality Reduction In Dialectometrymentioning

confidence: 99%

“…Secondly, the dialectologically meaningful features are tied to the words, which means we are not actually calculating the frequency of the variants themselves (cf. Kuparinen et al, 2021), but the combinations of words and variants. If we modify the example from before, the occurrences talosa "in a house," koulusa "in a school," and kirkosa "in a church" would all end up as different tokens in the corpus, although they all have the same dialectal variant -sa of the inessive case.…”

Section: Applying Topic Models To Phonetically Transcribed Dialect Co...mentioning

confidence: 99%

See 1 more Smart Citation

Corpus-based dialectometry with topic models

Kuparinen,

Scherrer

2024

J. of Ling. Geography

Self Cite

View full text Add to dashboard Cite

This paper presents a topic modeling approach to corpus-based dialectometry. Topic models are most often used in text mining to find latent structure in a collection of documents. They are based on the idea that frequently co-occurring words present the same underlying topic. In this study, topic models are used on interview transcriptions containing dialectal speech directly, without any annotations or preselected features. The transcriptions are modeled on complete words, on character n-grams, and after automatical segmentation. Data from three languages, Finnish, Norwegian, and Swiss German, are scrutinized. The proposed method is capable of discovering clear dialectal differences in all three datasets, while reflecting the differences between them. The method provides a significant simplification of the dialectometric workflow, simultaneously saving time and increasing objectivity. Using the method on non-normalized data could also benefit text mining, which is the traditional field of topic modeling.

show abstract

Section: Topic Models and Dimensionality Reduction In Dialectometrymentioning

confidence: 99%

Section: Applying Topic Models To Phonetically Transcribed Dialect Co...mentioning

confidence: 99%