Automatic diacritization of Arabic text using recurrent neural networks

Abandah, Gheith A.; Graves, Alex; Al-Shagoor, Balkees; Arabiyat, Alaa; Jamour, Fuad; Al-Taee, Majid A.

doi:10.1007/s10032-015-0242-2

Cited by 84 publications

(49 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They include hybridization of rules and dictionary retrievals with morphological analysis, N-grams, Hidden Markov Models, Dynamic Programming and Machine Learning methods [5,15,17,20,23,31,35,[37][38][39]42]. Some Deep Learning models improved by rules [2,3] have been developed as well.…”

Section: Rule-based Approaches the Used Methods Include Cascading Wementioning

confidence: 99%

Multi-components System for Automatic Arabic Diacritization

Abbad

Xiong

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections.We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word.Processing 1 The letter has another form represented as , and the letter has the following forms: , depending on its pronunciation and position in the word.

show abstract

Section: Rule-based Approaches the Used Methods Include Cascading Wementioning

confidence: 99%

Multi-components System for Automatic Arabic Diacritization

Abbad

Xiong

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…And the most current work in the area relies on hybrid approaches that combine rule-based and statistical modules [14]. Also, several systems 3 and tools have been developed for the resolution of the ambiguity for different levels of the analysis related to automatic diacritization for works such as [15][16][17][18][19][20][21][22]. Gal [23] used a HMM based on learning done on totally diacritized texts in his work, which achieved 85% good diacritizationwith some texts belonging to the training corpus.…”

Section: Related Workmentioning

confidence: 99%

A Hybrid Approach for the Morpho-Lexical Disambiguation of Arabic

2016

J Inf Process Syst

View full text Add to dashboard Cite

In order to considerably reduce the ambiguity rate, we propose in this article a disambiguation approach that is based on the selection of the right diacritics at different analysis levels. This hybrid approach combines a linguistic approach with a multi-criteria decision one and could be considered as an alternative choice to solve the morpho-lexical ambiguity problem regardless of the diacritics rate of the processed text. As to its evaluation, we tried the disambiguation on the online Alkhalil morphological analyzer (the proposed approach can be used on any morphological analyzer of the Arabic language) and obtained encouraging results with an F-measure of more than 80%.

show abstract

“…The Arabic alphabet is the base alphabet used in multiple languages including: Arabic, Persian and Kurdish. The Arabic language has 36 variants (see Figure 1) of the basic 28 letters and eight basic diacritics (see Figure 2) [4].…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, manually adding diacritization to clarify the content is time consuming and can only be reliable through linguistics experts specializing in the Arabic language. Thus, the need for an automated diacritization system is eminent [4], [5].…”

Section: Introductionmentioning

confidence: 99%

Arabic Text Diacritization Using Deep Neural Networks

Fadel

Tuffaha

Al-Jawarneh

et al. 2019

2019 2nd International Conference on Computer Applications &Amp; Information Security (ICCAIS)

View full text Add to dashboard Cite

Diacritization of Arabic text is both an interesting and a challenging problem at the same time with various applications ranging from speech synthesis to helping students learning the Arabic language. Like many other tasks or problems in Arabic language processing, the weak efforts invested into this problem and the lack of available (open-source) resources hinder the progress towards solving this problem. This work provides a critical review for the currently existing systems, measures and resources for Arabic text diacritization. Moreover, it introduces a much-needed free-for-all cleaned dataset that can be easily used to benchmark any work on Arabic diacritization. Extracted from the Tashkeela Corpus, the dataset consists of 55K lines containing about 2.3M words. After constructing the dataset, existing tools and systems are tested on it. The results of the experiments show that the neural Shakkala system significantly outperforms traditional rule-based approaches and other closed-source tools with a Diacritic Error Rate (DER) of 2.88% compared with 13.78%, which the best DER for the non-neural approach (obtained by the Mishkal tool).

show abstract

Automatic diacritization of Arabic text using recurrent neural networks

Cited by 84 publications

References 25 publications

Multi-components System for Automatic Arabic Diacritization

Multi-components System for Automatic Arabic Diacritization

A Hybrid Approach for the Morpho-Lexical Disambiguation of Arabic

Arabic Text Diacritization Using Deep Neural Networks

Contact Info

Product

Resources

About