BERT Models for Arabic Text Classification: A Systematic Review

Alammary, Ali

doi:10.3390/app12115720

Cited by 50 publications

(24 citation statements)

References 67 publications

(104 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The least-performing transformer-based method is XLM-R. A plausible explanation is that XLM-R is pretrained using multi-lingual data and usually outperformed by the monolingual models pretrained with large language-specific datasets and rich vocabularies ( Virtanen et al, 2019 ; Alammary, 2022 ). MARBERT achieved comparable performance to AraBERT in the micro-averaged F1-score but suffered from a performance gap in terms of the macro-averaged F-scores, although MARBERT is pretrained with larger data.…”

Section: Resultsmentioning

confidence: 99%

“…MARBERT achieved comparable performance to AraBERT in the micro-averaged F1-score but suffered from a performance gap in terms of the macro-averaged F-scores, although MARBERT is pretrained with larger data. A systematic review of BERT Models for various Arabic text classification problems ( Alammary, 2022 ) shows that AraBERT outperformed MARBERT in several tasks and vice versa . It also shows that a large pretraining corpus does not necessarily guarantee better performance.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations

Obeidat¹,

Gharaibeh²,

Abdullah³

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

Since the inception of the current COVID-19 pandemic, related misleading information has spread at a remarkable rate on social media, leading to serious implications for individuals and societies. Although COVID-19 looks to be ending for most places after the sharp shock of Omicron, severe new variants can emerge and cause new waves, especially if the variants can evade the insufficient immunity provided by prior infection and incomplete vaccination. Fighting the fake news that promotes vaccine hesitancy, for instance, is crucial for the success of the global vaccination programs and thus achieving herd immunity. To combat the proliferation of COVID-19-related misinformation, considerable research efforts have been and are still being dedicated to building and sharing COVID-19 misinformation detection datasets and models for Arabic and other languages. However, most of these datasets provide binary (true/false) misinformation classifications. Besides, the few studies that support multi-class misinformation classification deal with a small set of misinformation classes or mix them with situational information classes. False news stories about COVID-19 are not equal; some tend to have more sinister effects than others (e.g., fake cures and false vaccine info). This suggests that identifying the sub-type of misinformation is critical for choosing the suitable action based on their level of seriousness, ranging from assigning warning labels to the susceptible post to removing the misleading post instantly. We develop comprehensive annotation guidelines in this work that define 19 fine-grained misinformation classes. Then, we release the first Arabic COVID-19-related misinformation dataset comprising about 6.7K tweets with multi-class and multi-label misinformation annotations. In addition, we release a version of the dataset to be the first Twitter Arabic dataset annotated exclusively with six different situational information classes. Identifying situational information (e.g., caution, help-seeking) helps authorities or individuals understand the situation during emergencies. To confirm the validity of the collected data, we define three classification tasks and experiment with various machine learning and transformer-based classifiers to offer baseline results for future research. The experimental results indicate the quality and validity of the data and its suitability for constructing misinformation and situational information classification models. The results also demonstrate the superiority of AraBERT-COV19, a transformer-based model pretrained on COVID-19-related tweets, with micro-averaged F-scores of 81.6% and 78.8% for the multi-class misinformation and situational information classification tasks, respectively. Label Powerset with linear SVC achieved the best performance among the presented methods for multi-label misinformation classification with micro-averaged F-scores of 76.69%.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Resultsmentioning

confidence: 99%

Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations

Obeidat¹,

Gharaibeh²,

Abdullah³

et al. 2022

PeerJ Computer Science

View full text Add to dashboard Cite

show abstract

“…Given the effectiveness of transformer-based models, there have been various transformer models used in Arabic sentiment analysis. The widely utilized models are Multilingual BERT, AraBERT, and MARBERT [9]. The author in [10] addressed sentiment analysis in Modern Standard Arabic (MSA) and other Arabic dialects such as Levantine, Egyptian, and Gulf.…”

Section: Related Workmentioning

confidence: 99%

An Ensemble of Arabic Transformer-based Models for Arabic Sentiment Analysis

Karfi¹,

Fkihi²

2022

IJACSA

View full text Add to dashboard Cite

In recent years, sentiment analysis has gained momentum as a research area. This task aims at identifying the opinion that is expressed in a subjective statement. An opinion is a subjective expression describing personal thoughts and feelings. These thoughts and feelings can be assigned with a certain sentiment. The most studied sentiments are positive, negative, and neutral. Since the introduction of attention mechanism in machine learning, sentiment analysis techniques have evolved from recurrent neural networks to transformer models. Transformer-based models are encoder-decoder systems with attention. Attention mechanism has permitted models to consider only relevant parts of a given sequence. Making use of this feature in encoder-decoder architecture has impacted the performance of transformer models in several natural language processing tasks, including sentiment analysis. A significant number of Arabic transformer-based models have been pretrained recently to perform Arabic sentiment analysis tasks. Most of these models are implemented based on Bidirectional Encoder Representations from Transformers (BERT) such as AraBERT, CAMeLBERT, Arabic ALBERT and GigaBERT. Recent studies have confirmed the effectiveness of this type of models in Arabic sentiment analysis. Thus, in this work, two transformer-based models, namely AraBERT and CAMeLBERT have been experimented. Furthermore, an ensemble model has been implemented to achieve more reasonable performance.

show abstract

“…Like others, the region was rife with rumours and fake news. Advancing a classifier mechanism for Arabic linguistic needs understanding of syntactic framework of words so it could represent and manipulate the words for making their categorization very accurate [14]. The research into Arabic text classifiers can be confined when compared with the research volume on English textual classifiers.…”

Section: Introductionmentioning

confidence: 99%

Convolutional Deep Belief Network Based Short Text Classification on Arabic Corpus

Motwakel¹,

Al-onazi²,

Marzouk³

et al. 2023

Computer Systems Science and Engineering

View full text Add to dashboard Cite

With a population of 440 million, Arabic language users form the rapidly growing language group on the web in terms of the number of Internet users. 11 million monthly Twitter users were active and posted nearly 27.4 million tweets every day. In order to develop a classification system for the Arabic language there comes a need of understanding the syntactic framework of the words thereby manipulating and representing the words for making their classification effective. In this view, this article introduces a Dolphin Swarm Optimization with Convolutional Deep Belief Network for Short Text Classification (DSOCDBN-STC) model on Arabic Corpus. The presented DSOCDBN-STC model majorly aims to classify Arabic short text in social media. The presented DSOCDBN-STC model encompasses preprocessing and word2vec word embedding at the preliminary stage. Besides, the DSOCDBN-STC model involves CDBN based classification model for Arabic short text. At last, the DSO technique can be exploited for optimal modification of the hyperparameters related to the CDBN method. To establish the enhanced performance of the DSOCDBN-STC model, a wide range of simulations have been performed. The simulation results confirmed the supremacy of the DSOCDBN-STC model over existing models with improved accuracy of 99.26%.

show abstract

BERT Models for Arabic Text Classification: A Systematic Review

Cited by 50 publications

References 67 publications

Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations

Multi-label multi-class COVID-19 Arabic Twitter dataset with fine-grained misinformation and situational information annotations

An Ensemble of Arabic Transformer-based Models for Arabic Sentiment Analysis

Convolutional Deep Belief Network Based Short Text Classification on Arabic Corpus

Contact Info

Product

Resources

About