Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue

Ljubešić, Nikola; Petrović, Maja; Samardžić, Tanja

doi:10.1017/jlg.2018.9

Cited by 9 publications

(9 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the first subtask (Phon), we test whether BERTić can select the correct variant for a phonological variable, specifically the reflex of the Old Slavic vowel ě. This feature exhibits geographic variation in BCMS: In the (north-)west, the reflexes ije and je are predominately used, whereas the (south-)east mostly uses e (Ljubešić et al, 2018), e.g., lijepo vs. lepo ('nice'). Drawing upon words for which both ije/je and e variants exist in the BERTić vocabulary, we filter out words that appear in fewer than 10 posts in the merged VarDial dev and test data, resulting in a set of 64 words (i.e., 32 pairs).…”

Section: Zero-shot Dialect Feature Prediction (Zs-dialect)mentioning

confidence: 99%

Geographic Adaptation of Pretrained Language Models

Hofmann,

Glavaš,

Ljubešić

et al. 2024

Transactions of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

show abstract

Section: Zero-shot Dialect Feature Prediction (Zs-dialect)mentioning

confidence: 99%

Geographic Adaptation of Pretrained Language Models

Hofmann,

Glavaš,

Ljubešić

et al. 2024

Transactions of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…We perform evaluation of the models on this task on four datasets: the Croatian standard language dataset hr500k (Ljubešić et al, 2018) hr (Ljubešić et al, 2019a), the Serbian standard language dataset SETimes.SR (Batanović et al, 2018) and the Serbian non-standard Twitter language dataset ReLDI-sr (Ljubešić et al, 2019b).…”

Section: Morphosyntactic Taggingmentioning

confidence: 99%

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Ljubešić,

Lauc

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper we describe a transformer model pre-trained on 8 billion tokens of crawled text from the Croatian, Bosnian, Serbian and Montenegrin web domains. We evaluate the transformer model on the tasks of partof-speech tagging, named-entity-recognition, geo-location prediction and commonsense causal reasoning, showing improvements on all tasks over state-of-the-art models. For commonsense reasoning evaluation we introduce COPA-HR -a translation of the Choice of Plausible Alternatives (COPA) dataset into Croatian. The BERTić model is made available for free usage and further task-specific fine-tuning through HuggingFace.

show abstract

“…The Twitter user dataset (Twitter-HBS, Table 2) consists of tweets and their language tag (Bosnian, Croatian, Montenegrin, or Serbian). The main goal of creating this corpus is discrimination between closely related languages at the level of Twitter users (Ljubešić and Rupnik, 2022). The PE2rr corpus includes source language texts from many fields, as well as automatically produced translations into a number of morphologically rich languages, postedited versions of those texts, and error annotations of the post-edit processes that were carried out.…”

Section: Multilingual Corporamentioning

confidence: 99%

“…The transformer-based model was also introduced for several tasks in Serbian, Croatian, and Slovene, including NER (Ljubešić and Lauc, 2021). The model was pre-trained on web-crawled texts in Serbian, Bosnian, Croatian, and Slovene, consisting of 8 billion tokens, and then fine-tuned for NER on several openly available datasets, such as SETimes.SR , corpora of news articles, or ReLDI-sr (Ljubešić et al, 2017), corpora of annotated tweets. For reference, authors compared this model with CroSloEngual BERT (Ulčar and Robnik-Šikonja, 2020) and multilingual BERT (Devlin et al, 2018), where language-specific BERT-based models significantly outperformed multi-lingual BERT.…”

Section: Named Entity Recognitionmentioning

confidence: 99%

Creating a stop word dictionary in Serbian

Marovac¹,

Avdić²,

Ljajić³

2021

Sci Pub Univ Novi Pazar Ser A

View full text Add to dashboard Cite

By using natural language processing techniques, it is possible to get a lot of information from the extraction of document topics through mapping of document key words or content-based classification of documents, etc. To get this information, an important step is to separate words that carries informative value in a sentence from those words that do not affect its meaning. By using dictionaries of stop words specific to each natural language, the marking of words that do not carry meaning in the sentence is achieved. This paper presents creating a stop word dictionary in Serbian. The influence of stop words to the text processing is presented on three different data set. It is shown that by using proposed dictionary of Serbian stop words the data set dimension is reduced from 15% to 39%, while the quality of the obtained n-gram language models is improved.

show abstract

Borders and boundaries in Bosnian, Croatian, Montenegrin and Serbian: Twitter data to the rescue

Cited by 9 publications

References 21 publications

Geographic Adaptation of Pretrained Language Models

Geographic Adaptation of Pretrained Language Models

BERTić -- The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian

Creating a stop word dictionary in Serbian

Contact Info

Product

Resources

About