Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model

Babić, Karlo; Petrović, Milan; Beliga, Slobodan; Martinčić-Ipšić, Sanda; Matešić, Mihaela; Meštrović, Ana

doi:10.3390/app112110442

Cited by 19 publications

(30 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The seminal work of [41] contributed to the emergence of numerous variants of text representation models in terms of low-dimensional vectors in continuous space-embeddings, where embeddings allow semantically related linguistic units to be represented with similar vector representations. As described in [8] the first generation was characterised by shallow language models, such as Word2Vec [41], Doc2Vec [42], GloVe [43] and fastText [44]. They have some shortcomings, such as static embeddings in which multiple concepts (i.e., different meanings of the same entity, polysemy) are not represented by different embedding vectors, or poor performance in new domains.…”

Section: Text Featuresmentioning

confidence: 99%

“…For example, the outbreak of the COVID-19 disease caused a significant increase in social media usage among the public and it seriously affected the public's understanding of the COVID-19 risk [7]. In some countries there were many negative attitudes toward vaccines and anti-pandemic measures promoted on social networks [8]. Therefore, information spreading analysis during the global crisis is of great importance as one step of social media monitoring (infoveillance).…”

Section: Introductionmentioning

confidence: 99%

“…Consequently, it is one of the most studied social networks. Recently, especially for the monitoring and tracking different aspects of healthcare information and public disease [8,10,11]. Among all the user behavior in social media, the retweet is considered one of the primary functions for spreading information on Twitter [12,13].…”

Section: Introductionmentioning

confidence: 99%

“…The text features are represented as a low dimensional vector (embedding) that captures its semantic and structure. More specifically, we adopt a BERT-based language model, namely Cro-CoV-cseBERT [8] for tweets' representation as embeddings, which we use as the set of text features.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Meštrović¹,

Petrović²,

Beliga³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Retweet prediction is an important task related to different problems such as information spreading analysis, the automatic detection of fake news, social media monitoring, etc. In this study we explore the possibilities of retweet prediction based on heterogeneous data sources. In order to classify the tweet according to the amount of retweets, we combine features extracted from the multilayer network and the text. More specifically, we introduce a multilayer framework that proposes the multilayer network representation of Twitter. This formalism captures different users' actions and complex relationships as well as other key properties of communication on Twitter. We select a set of local network measures from each layer and construct a set of multilayer network features. In addition, we adopt a BERT-based language model, namely Cro-CoV-cseBERT to capture high-level semantics and structure of tweets as a set of text features. Then, we train six machine learning (ML) algorithms: random forest, multilayer perceptron, light gradient boosting machine, category embedding model, neural oblivious decision ensembles and attentive interpretable tabular learning model in the task of retweet prediction. We compare the performance of all six algorithms in three different setups (i) using only text features, (ii) using only multilayer network features and (iii) using both sets of features. We evaluate all setups in terms of standard evaluation measures i.e. precision, recall, F1-score and accuracy. For this task, we first prepare and use an empirical dataset of 199,431 tweets in the Croatian language posted during the period between January 1, 2020 and May 31, 2021. Our results indicate that by integrating multilayer network features with text features the prediction model would perform better than using just one set of features.

show abstract

Section: Text Featuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Meštrović¹,

Petrović²,

Beliga³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most studies employed different NLP techniques for capturing specific aspects of the COVID-19 content published online. For discovering public perceptions, opinions, and attitudes toward specific COVID-19-related topics, researchers commonly combine topic modeling and sentiment analysis [21,[26][27][28], which are also occasionally combined with named entity recognition (NER) [29].…”

Section: Prior Workmentioning

confidence: 99%

Infoveillance of the Croatian Online Media During the COVID-19 Pandemic: One-Year Longitudinal Study Using Natural Language Processing

Beliga¹,

Martinčić-Ipšić²,

Matešić³

et al. 2021

JMIR Public Health Surveill

Self Cite

View full text Add to dashboard Cite

Background Online media play an important role in public health emergencies and serve as essential communication platforms. Infoveillance of online media during the COVID-19 pandemic is an important step toward gaining a better understanding of crisis communication. Objective The goal of this study was to perform a longitudinal analysis of the COVID-19–related content on online media based on natural language processing. Methods We collected a data set of news articles published by Croatian online media during the first 13 months of the pandemic. First, we tested the correlations between the number of articles and the number of new daily COVID-19 cases. Second, we analyzed the content by extracting the most frequent terms and applied the Jaccard similarity coefficient. Third, we compared the occurrence of the pandemic-related terms during the two waves of the pandemic. Finally, we applied named entity recognition to extract the most frequent entities and tracked the dynamics of changes during the observation period. Results The results showed no significant correlation between the number of articles and the number of new daily COVID-19 cases. Furthermore, there were high overlaps in the terminology used in all articles published during the pandemic with a slight shift in the pandemic-related terms between the first and the second waves. Finally, the findings indicate that the most influential entities have lower overlaps for the identified people and higher overlaps for locations and institutions. Conclusions Our study shows that online media have a prompt response to the pandemic with a large number of COVID-19–related articles. There was a high overlap in the frequently used terms across the first 13 months, which may indicate the narrow focus of reporting in certain periods. However, the pandemic-related terminology is well-covered.

show abstract

Topic Modeling for Tracking COVID-19 Communication on Twitter

Bogović

Meštrović

Martinčić-Ipšić

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

Characterisation of COVID-19-Related Tweets in the Croatian Language: Framework Based on the Cro-CoV-cseBERT Model

Cited by 19 publications

References 47 publications

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Retweet Prediction based on Heterogeneous Data Sources: The Combination of Text and Multilayer Network Features

Infoveillance of the Croatian Online Media During the COVID-19 Pandemic: One-Year Longitudinal Study Using Natural Language Processing

Topic Modeling for Tracking COVID-19 Communication on Twitter

Contact Info

Product

Resources

About