Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Zeroual, Imad; Lakhouaja, Abdelhak

doi:10.1007/978-3-319-67056-0_29

Cited by 12 publications

(7 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers have worked over the past decade to create Arabic corpora (Alfaifi & Atwell, 2016;Zeroual & Lakhouaja, 2018). Several Arabic corpora were mainly derived from newspapers and designed primarily for researchers' projects but could not be accessed online (Al-Thubaity et al, 2013).…”

Section: Arabic Monolingual and Parallel Corporamentioning

confidence: 99%

Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions

2023

View full text Add to dashboard Cite

Arabic corpora have existed since the last decade of the past century. Although they are constantly increasing, more advanced tools and morpho-syntactically annotated Arabic corpora are still needed for research and teaching. Likewise, parallel and specialised corpora are rare despite the growing need to use them in empirical linguistic investigations of authentic Arabic texts and for language and translation teaching. Therefore, building legal corpora will pave the way for more research in Arabic legal translation, an area which is under-researched worldwide. This paper aims to discuss the building of a collection of specialised parallel and monolingual legal corpora. In particular, it will discuss the building of diachronic corpora, which include all available constitutions of 22 Arabic countries. The aim of building all available versions of these constitutions is two-fold: (1) interdisciplinary corpus-based and socio-cultural investigations and (2) research-led and blended-learning pedagogical approaches to translation teaching and learning. Thus, these corpora are of great value to translation trainers and researchers, law academics and professionals, and governmental, non-governmental and international organisations. The paper will demonstrate the process of building these specialised complex corpora and the challenges encountered throughout this process. Among the challenges faced during the data collection and processing phases are (1) limitations of finding the original constitutions for each Arabic country since some of them date back to 1922; (2) file conversion and the difficulty of choosing one Optical Character Recognition (OCR) tool to rely on for the Arabic language since many lack accuracy, efficiency as well as encoding issues in Arabic.

show abstract

Section: Arabic Monolingual and Parallel Corporamentioning

confidence: 99%

Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions

2023

View full text Add to dashboard Cite

show abstract

“…Several scholars have discussed the difficulties associated with developing natural language processing methods and algorithms for Arabic. These challenges include the ambiguity and complexity of Arabic (Kanan & Fox, 2016;Salloum, Al-emran, & Shaalan, 2016), the prevalence of several commonly used dialects in Arabic (Samih et al, 2017;Zalmout, Erdmann, & Habash, 2018), and the limited number of freely available datasets that can be used in the research and development for Arabic computational solutions (Zeroual & Lakhouaja, 2018). This study further investigates the complexity of Arabic and the problems associated with computational solutions that do not incorporate Arabic dialects.…”

Section: Arabic Natural Language Processingmentioning

confidence: 99%

Image Classification in Arabic: Exploring Direct English to Arabic Translations

Alsudais

2019

IEEE Access

View full text Add to dashboard Cite

Image classification is an ongoing research challenge.Most of the available research focuses on image classification for the English language, however there is very little research on image classification for the Arabic language. Expanding image classification to Arabic has several applications. The present study investigated a method for generating Arabic labels for images of objects. The method used in this study involved a direct English to Arabic translation of the labels that are currently available on ImageNet, a database commonly used in image classification research. The purpose of this study was to test the accuracy of this method. In this study, 2,887 labeled images were randomly selected from ImageNet. All of the labels were translated from English to Arabic using Google Translate. The accuracy of the translations was evaluated. Results indicated that that 65.6% of the Arabic labels were accurate. This study makes three important contributions to the image classification literature: (1) it determined the baseline level of accuracy for algorithms that provide Arabic labels for images, (2) it provided 1,895 images that are tagged with accurate Arabic labels, and (3) provided the accuracy of translations of image labels from English to Arabic.

show abstract

“…There have been a lot of efforts and studies were devoted to Arabic natural language processing and its applications [1]. Last years have witnessed remarkable progress in building Arabic corpora [2] and developing robust morphological analyzers [3], which paved the way for highly data-driven approaches like text classification, information retrieval, and machine translation.…”

Section: Introductionmentioning

confidence: 99%

The effects of Pre-Processing Techniques on Arabic Text Classification

2021

IJATCSE

View full text Add to dashboard Cite

In the last two decades, the amount of available Arabic text data on the World Wide Web is dramatically growing, making it the fourth most used language on the web. Accordingly, the demand for efficient Arabic text classification is increasing, especially for web page content filtering, information retrieval, and e-mail spam detection. Several Machine Learning algorithms have been implemented to classify Arabic documents. However, the results achieved are not comparable with those obtained in other languages such as English, primarily when using preprocessing techniques that do not take into consideration the Arabic language features. This paper investigates the impact of wisely selected preprocessing techniques on the efficiency of different text classification algorithms. The effects of stop words removal, stemming, lemmatization, and all possible combinations are examined. The reported results (+10.75% to +28.73%) prove the effectiveness of using these techniques either individually or in combination.

show abstract

Arabic Corpus Linguistics: Major Progress, but Still a Long Way to Go

Cited by 12 publications

References 51 publications

Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions

Building the Leeds Monolingual and Parallel Legal Corpora of Arabic and English Countries’ Constitutions: Methods, Challenges and Solutions

Image Classification in Arabic: Exploring Direct English to Arabic Translations

The effects of Pre-Processing Techniques on Arabic Text Classification

Contact Info

Product

Resources

About