Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Zerrouki, Taha; Balla, Amar

doi:10.1016/j.dib.2017.01.011

Cited by 59 publications

(29 citation statements)

References 10 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mishkal [44] is an application which diacritize a text by generating the possible diacritized word forms through the detection of affixes and the use of a dictionary, then limiting them using semantic relations, and finally choosing the most likely diacritization. Tashkeela-Model [11] uses a basic N-gram language model on character level trained on the Tashkeela corpus [45]. Shakkala [13] is a character-level deep learning system made of an embedding, three bidirectional LSTM, and dense layers.…”

Section: Rule-based Approaches the Used Methods Include Cascading Wementioning

confidence: 99%

See 1 more Smart Citation

Multi-components System for Automatic Arabic Diacritization

Abbad

Xiong

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this paper, we propose an approach to tackle the problem of the automatic restoration of Arabic diacritics that includes three components stacked in a pipeline: a deep learning model which is a multi-layer recurrent neural network with LSTM and Dense layers, a character-level rule-based corrector which applies deterministic operations to prevent some errors, and a word-level statistical corrector which uses the context and the distance information to fix some diacritization issues. This approach is novel in a way that combines methods of different types and adds edit distance based corrections.We used a large public dataset containing raw diacritized Arabic text (Tashkeela) for training and testing our system after cleaning and normalizing it. On a newly-released benchmark test set, our system outperformed all the tested systems by achieving DER of 3.39% and WER of 9.94% when taking all Arabic letters into account, DER of 2.61% and WER of 5.83% when ignoring the diacritization of the last letter of every word.Processing 1 The letter has another form represented as , and the letter has the following forms: , depending on its pronunciation and position in the word.

show abstract

Section: Rule-based Approaches the Used Methods Include Cascading Wementioning

confidence: 99%

“…In this work, the Tashkeela corpus [45] was mainly used for training and testing our model. This dataset is made of 97 religious books written in the Classical Arabic style, with a small part of web crawled text written in the Modern Standard Arabic style.…”

Section: Datasetmentioning

confidence: 99%

Multi-components System for Automatic Arabic Diacritization

Abbad

Xiong

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…This corpus contains 6,000 texts (1 billion words). Zerrouki and Balla (2017) propose a large freely available vocalized corpus, containing 75 million words, collected from freely published texts in old books. Asda et al (2016), propose the development of Quran reciter recognition and identification system, based on Mel-Frequency Cepstral Coefficient (MFCC) feature extraction and Artificial Neural Networks.…”

Section: Building Resources (Br)mentioning

confidence: 99%

Arabic natural language processing: An overview

Guellil

Saadane²,

Azouaou

et al. 2021

Journal of King Saud University - Computer and Information Scie

View full text Add to dashboard Cite

a b s t r a c tArabic is recognised as the 4th most used language of the Internet. Arabic has three main varieties: (1) classical Arabic (CA), (2) Modern Standard Arabic (MSA), (3) Arabic Dialect (AD). MSA and AD could be written either in Arabic or in Roman script (Arabizi), which corresponds to Arabic written with Latin letters, numerals and punctuation. Due to the complexity of this language and the number of corresponding challenges for NLP, many surveys have been conducted, in order to synthesise the work done on Arabic. However these surveys principally focus on two varieties of Arabic (MSA and AD, written in Arabic letters only), they are slightly old (no such survey since 2015) and therefore do not cover recent resources and tools. To bridge the gap, we propose a survey focusing on 90 recent research papers (74% of which were published after 2015). Our study presents and classifies the work done on the three varieties of Arabic, by concentrating on both Arabic and Arabizi, and associates each work to its publicly available resources whenever available.

show abstract

“…Corpora [29], the King Saud University corpus of Classical Arabic [30], Alwatan [31], Tashkeela [32] and the Al Khaleej Corpus [33]. The monolingual corpora consist of a raw text written in a single language.…”

Section: ) Raw Text Corpora Can Be Divided Intomentioning

confidence: 99%

BAAC: Bangor Arabic Annotated Corpus

Alkhazi¹,

William²

2018

ijacsa

View full text Add to dashboard Cite

This paper describes the creation of the new Bangor Arabic Annotated Corpus (BAAC) which is a Modern Standard Arabic (MSA) corpus that comprises 50K words manually annotated by parts-of-speech. For evaluating the quality of the corpus, the Kappa coefficient and a direct percent agreement for each tag were calculated for the new corpus and a Kappa value of 0.956 was obtained, with an average observed agreement of 94.25%. The corpus was used to evaluate the widely used Madamira Arabic part-of-speech tagger and to further investigate compression models for text compressed using partof-speech tags. Also, a new annotation tool was developed and employed for the annotation process of BAAC. Keywords-Component; arabic language; corpus; annotated corpora; analysis results I. BACKGROUND AND MOTIVATION The Arabic language ‫"انعربيت"‬ is acknowledged to be one of the most largely used languages, with 330 million people using the language as their first language, as shown in Table 1, plus 1.4 billion more using it as a secondary language [1]. The majority of the speakers are located across twenty-two nations, primarily in the Middle East, North Africa and Asia, and the United Nations considers the Arabic language as one of its five official languages. The Arabic language is part of the Semitic languages that includes Tigrinya, Amharic, Hebrew, etc., and shares almost the same structure as those languages. It has 28 letters, two gendersfeminine and masculine, as well as singular, dual and plural forms. The Arabic language has a right-to-left writing system with the basic grammatical structure that consists of verb-subject-object and other structures, such as VOS, VO and SVO [2]-[4]. TABLE I. THE MOST UNIVERSALLY USED LANGUAGES Rank Language Users (millions)

show abstract

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Cited by 59 publications

References 10 publications

Multi-components System for Automatic Arabic Diacritization

Multi-components System for Automatic Arabic Diacritization

Arabic natural language processing: An overview

BAAC: Bangor Arabic Annotated Corpus

Contact Info

Product

Resources

About