Translated into Arabic, the ASES index was found to possess high metrological qualities. While the index has been satisfactorily validated with regard to a Tunisian population, additional studies are needed to verify its applicability to other Arab populations.
The evolution of information and communication technology has markedly influenced communication between correspondents. This evolution has facilitated the transmission of information and has engendered new forms of written communication (email, chat, SMS, comments, etc.). Most of these messages and comments are written in Latin script, also called
Arabizi
. Moreover, the language used in social media and SMS messaging is characterized by the use of informal and non-standard vocabulary, such as repeated letters for emphasis, typos, non-standard abbreviations, and nonlinguistic content like emoticons. Since the Tunisian dialect suffers from the unavailability of basic tools and linguistic resources compared to Modern Standard Arabic, we resort to the use of these written sources as a starting point to build large corpora automatically. In the context of natural language processing and to benefit from these networks’ data, transliterating from Arabizi to Arabic script is a necessary step because most recently available tools for processing the Tunisian dialect expect Arabic script input. Indeed, the transliteration task can help construct and enrich parallel corpora and dictionaries for the Tunisian dialect and can be useful for developing various natural language processing applications such as sentiment analysis, opinion mining, topic detection, and machine translation. In this article, we focus on converting the Tunisian dialect text that is written in Latin script to Arabic script following the Conventional Orthography for Dialectal Arabic. Then, we propose two models to transliterate Arabizi into Arabic script for the Tunisian dialect, namely a rule-based model and a discriminative model as a sequence classification task based on conditional random fields). In the first model, we use a set of transliteration rules to convert the Tunisian dialect Arabizi texts to Arabic script. In the second model, transliteration is performed both at word and character levels. In the end, our models got a character error rate of 10.47%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.