ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks

Haouari, Fatima; Hasanain, Maram; Suwaileh, Reem; Elsayed, Tamer

doi:10.48550/arxiv.2004.05861

Cited by 15 publications

(18 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of these datasets are generic, and lack annotations or labels. Examples include multilingual corpus on a wide variety of topics related to COVID-19 [CLF20, AMEP + 20, HJB + 20], longitudinal Twitter chatter dataset [BTW + 20], multilingual dataset with location information of the users [QIO20], Twitter dataset for Arabic tweets [AAA20], Twitter dataset for popular Arabic tweets [HHSE20], and dataset for identification of stance, replies, and quotes [VCKBC20]. Most of these datasets either have no annotations at all, employ automated annotations using transfer learning or semi-supervised methods, or are not specifically designed for misinformation.…”

Section: Covid-19 Datasetsmentioning

confidence: 99%

Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

Memon¹,

Carley²

2020

Preprint

View full text Add to dashboard Cite

From conspiracy theories to fake cures and fake treatments, COVID-19 has become a hotbed for the spread of misinformation online. It is more important than ever to identify methods to debunk and correct false information online. In this paper, we present a methodology and analyses to characterize the two competing COVID-19 misinformation communities online: (i) misinformed users or users who are actively posting misinformation, and (ii) informed users or users who are actively spreading true information, or calling out misinformation. The goals of this study are twofold: (i) collecting a diverse set of annotated COVID-19 Twitter dataset that can be used by the research community to conduct meaningful analysis; and (ii) characterizing the two target communities in terms of their network structure, linguistic patterns, and their membership in other communities. Our analyses show that COVID-19 misinformed communities are denser, and more organized than informed communities, with a possibility of a high volume of the misinformation being part of disinformation campaigns. Our analyses also suggest that a large majority of misinformed users may be anti-vaxxers. Finally, our sociolinguistic analyses suggest that COVID-19 informed users tend to use more narratives than misinformed users.

show abstract

Section: Covid-19 Datasetsmentioning

confidence: 99%

Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

Memon¹,

Carley²

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The code used for data processing is written in Python 3. The code required to hydrate tweets and to use the provided base release files is available on GitHub 10 . Furthermore, we postulate that this large-scale, multilingual, geotagged social media data can empower multidisciplinary research communities to perform longitudinal studies, evaluate how societies are collectively coping with this unprecedented global crisis as well as to develop computational methods to address real-world challenges, including but not limited to the following:…”

Section: Usage Notesmentioning

confidence: 99%

TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

Imran¹,

Qazi²,

Ofli³

2021

Preprint

View full text Add to dashboard Cite

The widespread usage of social networks during mass convergence events, such as health emergencies and disease outbreaks, provides instant access to citizen-generated data that carry rich information about public opinions, sentiments, urgent needs, and situational reports. Such information can help authorities understand the emergent situation and react accordingly. Moreover, social media plays a vital role in tackling misinformation and disinformation. This work presents TBCOV, a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. More importantly, several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. Last but not least, a geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues. Our sentiment and trend analyses reveal interesting insights and confirm TBCOV's broad coverage of important topics.

show abstract

“…Some of the datasets further apply language filters [1,10,20,33] or other requirements such as the availability of location information [19]. Instead of filtering from Twitter streaming data, authors of ArCOV-19 [14] collect tweets returned by the Twitter standard search API 28 when using COVID-19 related keywords (e.g. Corona) as queries and written in Arabic.…”

Section: Related Workmentioning

confidence: 99%

“…Most of the found datasets are being updated regularly. The number of tweets contained in the 13 datasets range from 747,599 [14] to over 524 million [25] by the time of this study, i.e. 20 May, 2020.…”

Section: Related Workmentioning

confidence: 99%

TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic

Dimitrov,

Baran,

Fafalios

et al. 2020

Preprint

View full text Add to dashboard Cite

Publicly available social media archives facilitate research in the social sciences and provide corpora for training and testing a wide range of machine learning, NLP and information retrieval methods. With respect to the recent outbreak of COVID-19, online discourse on Twitter reflects public opinion and perception related to the pandemic itself as well as mitigating measures and their societal impact. Understanding such discourse, its evolution and interdependencies with real-world events or (mis)information can foster valuable insights. On the other hand, such corpora are crucial facilitators for computational methods addressing tasks such as sentiment analysis, event detection or entity recognition. However, obtaining, archiving and semantically annotating large amounts of tweets is costly. In this paper, we describe TweetsCOV19, a publicly available knowledge base of currently more than 8 million tweets, spanning the period Oct'19-Apr'20. Metadata about the tweets as well as extracted entities, hashtags, user mentions, sentiments, and URLs are exposed using established RDF/S vocabularies, providing an unprecedented knowledge base for a range of knowledge discovery tasks. Next to a description of the dataset and its extraction and annotation process, we present an initial analysis, use cases and usage of the corpus.

show abstract

ArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks

Cited by 15 publications

References 0 publications

Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

Characterizing COVID-19 Misinformation Communities Using a Novel Twitter Dataset

TBCOV: Two Billion Multilingual COVID-19 Tweets with Sentiment, Entity, Geo, and Gender Labels

TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic

Contact Info

Product

Resources

About