This paper introduces a dual-mode stochastic system to automatically identify linguistic code switch points in Arabic. The first of these modes determines the most likely word tag (i.e. dialect or modern standard Arabic) by choosing the sequence of Arabic word tags with maximum marginal probability via lattice search and 5-gram probability estimation. When words are out of vocabulary, the system switches to the second mode which uses a dialectal Arabic (DA) and modern standard Arabic (MSA) morphological analyzer. If the OOV word is analyzable using the DA morphological analyzer only, it is tagged as "DA", if it is analyzable using the "MSA" morphological analyzer only, it is tagged as MSA, otherwise if analyzable using both of them, then it is tagged as "both". The system yields an F β=1 score of 76.9% on the development dataset and 76.5% on the held-out test dataset, both judged against human-annotated Egyptian forum data.
In this paper, we address the problem of converting Dialectal Arabic (DA) text that is written in the Latin script (called Arabizi) into Arabic script following the CODA convention for DA orthography. The presented system uses a finite state transducer trained at the character level to generate all possible transliterations for the input Arabizi words. We then filter the generated list using a DA morphological analyzer. After that we pick the best choice for each input word using a language model. We achieve an accuracy of 69.4% on an unseen test set compared to 63.1% using a system which represents a previously proposed approach.
In this paper, we present the latest version of our system for identifying linguistic code switching in Arabic text. The system relies on Language Models and a tool for morphological analysis and disambiguation for Arabic to identify the class of each word in a given sentence. We evaluate the performance of our system on the test datasets of the shared task at the EMNLP workshop on Computational Approaches to Code Switching (Solorio et al., 2014). The system yields an average token-level F β=1 score of 93.6%, 77.7% and 80.1%, on the first, second, and surprise-genre test-sets, respectively, and a tweet-level F β=1 score of 4.4%, 36% and 27.7%, on the same test-sets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.