Most of the Arabic Named Entity Recognition (NER) systems depend massively on external resources and handmade feature engineering to achieve state-of-the-art results. To overcome such limitations, we proposed, in this paper, to use deep learning approach to tackle the Arabic NER task. We introduced a neural network architecture based on bidirectional Long Short-Term Memory (LSTM) and Conditional Random Fields (CRF) and experimented with various commonly used hyperparameters to assess their effect on the overall performance of our system. Our model gets two sources of information about words as input: pre-trained word embeddings and character-based representations and eliminated the need for any task-specific knowledge or feature engineering. We obtained state-of-the-art result on the standard ANERcorp corpus with an F1 score of 90.6%.
With the proliferation of social media and Internet accessibility, a massive amount of data has been produced. In most cases, the textual data available through the web comes mainly from people expressing their views in informal words. The Arabic language is one of the hardest Semitic languages to deal with because of its complex morphology. In this paper, a new contribution to the Arabic resources is presented as a large Moroccan dataset retrieved from Twitter and carefully annotated by native speakers. For the best of our knowledge, this dataset is the largest Moroccan dataset for sentiment analysis. It is distinguished by its size, its quality given by the commitment of annotators, and its accessibility for the research community. Furthermore, the MSTD (Moroccan Sentiment Twitter Dataset) is benchmarked through experiments carried out for 4-way classification as well as polarity classification (positive, negative). Various machine-learning algorithms are combined to feature extraction techniques to reach optimal settings. This work also presents the effect of stemming and lemmatization on the improvement of the obtained accuracies.
In the last few years, significant amounts of text data have emerged on the different social media platforms. A tendency to extract valuable information from these data for useful purposes has been created and developed. The Named Entity Recognition (NER), as a subtask of the Natural Language Processing (NLP), remains primordial in order to perform these extractions and the classification of entity names from the text regardless of its structure "formal or informal". Nevertheless, the most recent solutions for NER are confronted with the difficulty of adapting to the informal texts used on social media platforms. This work aims at providing a literature review of the various papers published in the field of NER on social media starting from 2014 until now, by identifying the particular characteristics surrounding the Arabic dialect compared to the English language.
Abstract:In this article, we introduce novel features for Arabic Named Entity Recognition (NER) based on Latent Dirichlet Allocation (LDA), a widely used topic modeling technique. We investigate and analyze three different approaches for utilizing LDA, including two newly proposed ones, namely Topical Prototypes approach and Topical Word Embeddings approach. Our Experiments show that each of the presented approaches improves the baseline features, among which the Word-Class LDA approach performs the best. Moreover, the combination of these topic modeling approaches provides additive improvements, outperforming traditional word representations as Skip-gram word embeddings and Brown Clustering. The proposed LDA-based features, learned in an unsupervised way, are fully language-independent and have proven to be very effective to enrich and boost NER models for Arabic, a morphologically rich language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.