Methodical Evaluation of Arabic Word Embeddings

El-Razzaz, Mohammed; Elbassuoni, Shady; Shaban, Khaled; Helwe, Chadi

doi:10.18653/v1/p17-2072

Cited by 18 publications

(13 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These vectors capture semantic information between words; the words with similar meaning will have vectors closer to each other. Building word embedding model using a large-scale training dataset is important to obtain meaningful embeddings [47]. We built a word vectors model exploiting our whole COVID-19 dataset collected from January 2020 to April 2020.…”

Section: Misinformation Headline Tweet Examplesmentioning

confidence: 99%

Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter

Alqurashi¹,

Hamoui²,

Alashaikh³

et al. 2021

Preprint

View full text Add to dashboard Cite

The rapid growth of social media content during the current pandemic provides useful tools for disseminating information which has also become a root for misinformation. Therefore, there is an urgent need for fact-checking and effective techniques for detecting misinformation in social media. In this work, we study the misinformation in the Arabic content of Twitter. We construct a large Arabic dataset related to COVID-19 misinformation and gold-annotate the tweets into two categories: misinformation or not. Then, we apply eight different traditional and deep machine learning models, with different features including word embeddings and word frequency. The word embedding models (FASTTEXT and word2vec) exploit more than two million Arabic tweets related to COVID-19. Experiments show that optimizing the area under the curve (AUC) improves the models' performance and the Extreme Gradient Boosting (XGBoost) presents the highest accuracy in detecting COVID-19 misinformation online.

show abstract

Section: Misinformation Headline Tweet Examplesmentioning

confidence: 99%

Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter

Alqurashi¹,

Hamoui²,

Alashaikh³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Word Embedding tools, technologies and pre-trained models are widely available for resource rich languages such as English (Mikolov et al, 2013;Pennington et al, 2014; and Chinese (Li et al, 2018;Chen et al, 2015). Due to the wide use of Word Embeddings, pre-trained models are increasingly available for resource poor languages such as Portuguese (Hartmann et al, 2017), Arabic (Elrazzaz et al, 2017;Soliman et al, 2017), and Bengali (Ahmad and Amin, 2016).…”

Section: Related Workmentioning

confidence: 99%

NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Koirala¹,

Niraula²

2021

Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)

View full text Add to dashboard Cite

Word Embedding maps words to vectors of real numbers. It is derived from a large corpus and is known to capture semantic knowledge from the corpus. Word Embedding is a critical component of many state-of-the-art Deep Learning techniques. However, generating good Word Embeddings is a special challenge for low-resource languages such as Nepali due to the unavailability of large text corpus. In this paper, we present NPVec1 which consists of 25 state-of-art Word Embeddings for Nepali that we have derived from a large corpus using GloVe, Word2Vec, fastText, and BERT. We further provide intrinsic and extrinsic evaluations of these Embeddings using well established metrics and methods. These models are trained using 279 million word tokens and are the largest Embeddings ever trained for Nepali language. Furthermore, we have made these Embeddings publicly available to accelerate the development of Natural Language Processing (NLP) applications in Nepali.

show abstract

“…Most of the proposed evaluation schemes are based on word analogies that were presented in (Mikolov et al, 2013b) for the English language. For the Arabic language (Elrazzaz et al, 2017), a benchmark has been created so that it can be utilized to perform intrinsic evaluation of different word embeddings.…”

Section: Related Workmentioning

confidence: 99%

Evaluation of Greek Word Embeddings

Outsios¹,

Karatsalos²,

Skianis³

et al. 2019

Preprint

View full text Add to dashboard Cite

Since word embeddings have been the most popular input for many NLP tasks, evaluating their quality is critical. Most research efforts are focusing on English word embeddings. This paper addresses the problem of training and evaluating such models for the Greek language. We present a new word analogy test set considering the original English Word2vec analogy test set and some specific linguistic aspects of the Greek language as well. Moreover, we create a Greek version of WordSim353 test collection for a basic evaluation of word similarities. Produced resources are available for download. We test seven word vector models and our evaluation shows that we are able to create meaningful representations. Last, we discover that the morphological complexity of the Greek language and polysemy can influence the quality of the resulting word embeddings.

show abstract

Methodical Evaluation of Arabic Word Embeddings

Cited by 18 publications

References 5 publications

Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter

Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter

NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Evaluation of Greek Word Embeddings

Contact Info

Product

Resources

About