NADA: New Arabic Dataset for Text Classification

Alalyani, Nada; Marie-Sainte, Souad Larabi

doi:10.14569/ijacsa.2018.090928

Cited by 17 publications

(10 citation statements)

References 17 publications

(24 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For Nada, the accuracy results of our SVM and CNN models are shown in table 10. It is clear that the accuracy of our models with accuracy near 100% is superior to those obtained in [44].…”

Section: E Nadamentioning

confidence: 56%

See 1 more Smart Citation

A Superior Arabic Text Categorization Deep Model (SATCDM)

Alhawarat

Aseeri

2020

IEEE Access

View full text Add to dashboard Cite

Categorizing Arabic text documents is considered an important research topic in the field of Natural Language Processing (NLP) and Machine Learning (ML). The number of Arabic documents is tremendously increasing daily as new web pages, news articles, social media contents are added. Hence, classifying such documents in specific classes is of high importance to many people and applications. Convolutional Neural Network (CNN) is a class of deep learning that has been shown to be useful for many NLP tasks, including text translation and text categorization for the English language. Word embedding is a text representation currently used to represent text terms as real-valued vectors in vector space that represent both syntactic and semantic traits of text. Current research studies in classifying Arabic text documents use traditional text representation such as bag-of-words and TF-IDF weighting, but few use word embedding. Traditional ML algorithms have already been used in Arabic text categorization, and good results are achieved. In this study, we present a Multi-Kernel CNN model for classifying Arabic news documents enriched with n-gram word embedding, which we call A Superior Arabic Text Categorization Deep Model (SATCDM). The proposed solution achieves very high accuracy compared to current research in Arabic text categorization using 15 of freely available datasets. The model achieves an accuracy ranging from 97.58% to 99.90%, which is superior to similar studies on the Arabic document classification task.

show abstract

“…For Nada, the accuracy results of our SVM and CNN models are shown in table 10. It is clear that the accuracy of our models with accuracy near 100% is superior to those obtained in [44].…”

Section: E Nadamentioning

confidence: 56%

“…This subsection compares the results of this study on NADA dataset with the results obtained by [44]. The authors argue that the low accuracy they obtained for NADA dataset (93.88%) is due to Abuaiadah dataset, where its classification accuracy was around 80%.…”

Section: E Nadamentioning

confidence: 73%

A Superior Arabic Text Categorization Deep Model (SATCDM)

Alhawarat

Aseeri

2020

IEEE Access

View full text Add to dashboard Cite

show abstract

“…Arabic text classification research and the goal to enrich the Arabic corpus are slowly becoming a priority in the research community. In [ 31 ], the authors believe that many of the available datasets are not appropriate for classification, either because the classes are not defined well, or there are not any defined classes like in the 1.5 billion words Arabic Corpus [ 11 ]. The authors also introduce 'NADA,' a new filtered and preprocessed corpus, that combine already existing corpora DAA and OSAC.…”

Section: Literature Reviewmentioning

confidence: 99%

Arabic text classification: the need for multi-labeling systems

Rifai

Qadi

Elnagar

2021

Neural Comput & Applic

View full text Add to dashboard Cite

The process of tagging a given text or document with suitable labels is known as text categorization or classification. The aim of this work is to automatically tag a news article based on its vocabulary features. To accomplish this objective, 2 large datasets have been constructed from various Arabic news portals. The first dataset contains of 90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset has over 290 k multi-tagged articles. To examine the single-label dataset, we employed an array of ten shallow learning classifiers. Furthermore, we added an ensemble model that adopts the majority-voting technique of all studied classifiers. The performance of the classifiers on the first dataset ranged between 87.7% (AdaBoost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label opposed to single-label categorization for better classification results. For the second dataset, we tested both shallow learning and deep learning multi-labeling approaches. A custom accuracy metric, designed for the multi-labeling task, has been developed for performance evaluation along with hamming loss metric. Firstly, we used classifiers that were compatible with multi-labeling tasks such as Logistic Regression and XGBoost, by wrapping each in a OneVsRest classifier. XGBoost gave the higher accuracy, scoring 84.7%, while Logistic Regression scored 81.3%. Secondly, ten neural networks were constructed (CNN, CLSTM, LSTM, BILSTM, GRU, CGRU, BIGRU, HANGRU, CRF-BILSTM and HANLSTM). CGRU proved to be the best multi-labeling classifier scoring an accuracy of 94.85%, higher than the rest of the classifies.

show abstract

“…There are number of Arabic datasets like, DAA [11] is a dataset in which nine categories have been processed and standardized with 400 documents for each category, Akhbar-Alkhaleej [12] is a popular Arabic Dataset with 5690 Arabic news documents gathered regularly from the online newspaper "Akhbar-Alkhaleej". It consists of five categories: Alwatan [13] is an Arabic Dataset with 20,291 Arabic news documents collected regularly from its online newspaper, Al-Jazeera-News [14] Arabic Dataset (Alj-News) is an Arabic dataset with 1500 documents.…”

Section: Related Work IImentioning

confidence: 99%