2018
DOI: 10.1017/jlg.2018.9
|View full text |Cite
|
Sign up to set email alerts
|

Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue

Abstract: In this paper we deal with the spatial distribution of 16 linguistic features known to vary between Bosnian, Croatian, Montenegrin, and Serbian. We perform our analyses on a dataset of geo-encoded Twitter status messages collected in the period from mid-2013 to the end of 2016. We perform two types of analyses. The first one finds boundaries in the spatial distribution of the linguistic variable levels through the kernel density estimation smoothing technique. These boundaries are then plotted over the state b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 9 publications
(9 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…In the first subtask (Phon), we test whether BERTić can select the correct variant for a phonological variable, specifically the reflex of the Old Slavic vowel ě. This feature exhibits geographic variation in BCMS: In the (north-)west, the reflexes ije and je are predominately used, whereas the (south-)east mostly uses e (Ljubešić et al, 2018), e.g., lijepo vs. lepo ('nice'). Drawing upon words for which both ije/je and e variants exist in the BERTić vocabulary, we filter out words that appear in fewer than 10 posts in the merged VarDial dev and test data, resulting in a set of 64 words (i.e., 32 pairs).…”
Section: Zero-shot Dialect Feature Prediction (Zs-dialect)mentioning
confidence: 99%
“…In the first subtask (Phon), we test whether BERTić can select the correct variant for a phonological variable, specifically the reflex of the Old Slavic vowel ě. This feature exhibits geographic variation in BCMS: In the (north-)west, the reflexes ije and je are predominately used, whereas the (south-)east mostly uses e (Ljubešić et al, 2018), e.g., lijepo vs. lepo ('nice'). Drawing upon words for which both ije/je and e variants exist in the BERTić vocabulary, we filter out words that appear in fewer than 10 posts in the merged VarDial dev and test data, resulting in a set of 64 words (i.e., 32 pairs).…”
Section: Zero-shot Dialect Feature Prediction (Zs-dialect)mentioning
confidence: 99%
“…We perform evaluation of the models on this task on four datasets: the Croatian standard language dataset hr500k (Ljubešić et al, 2018) hr (Ljubešić et al, 2019a), the Serbian standard language dataset SETimes.SR (Batanović et al, 2018) and the Serbian non-standard Twitter language dataset ReLDI-sr (Ljubešić et al, 2019b).…”
Section: Morphosyntactic Taggingmentioning
confidence: 99%
“…The Twitter user dataset (Twitter-HBS, Table 2) consists of tweets and their language tag (Bosnian, Croatian, Montenegrin, or Serbian). The main goal of creating this corpus is discrimination between closely related languages at the level of Twitter users (Ljubešić and Rupnik, 2022). The PE2rr corpus includes source language texts from many fields, as well as automatically produced translations into a number of morphologically rich languages, postedited versions of those texts, and error annotations of the post-edit processes that were carried out.…”
Section: Multilingual Corporamentioning
confidence: 99%
“…The transformer-based model was also introduced for several tasks in Serbian, Croatian, and Slovene, including NER (Ljubešić and Lauc, 2021). The model was pre-trained on web-crawled texts in Serbian, Bosnian, Croatian, and Slovene, consisting of 8 billion tokens, and then fine-tuned for NER on several openly available datasets, such as SETimes.SR , corpora of news articles, or ReLDI-sr (Ljubešić et al, 2017), corpora of annotated tweets. For reference, authors compared this model with CroSloEngual BERT (Ulčar and Robnik-Šikonja, 2020) and multilingual BERT (Devlin et al, 2018), where language-specific BERT-based models significantly outperformed multi-lingual BERT.…”
Section: Named Entity Recognitionmentioning
confidence: 99%