2017
DOI: 10.1016/j.dib.2017.01.011
|View full text |Cite
|
Sign up to set email alerts
|

Tashkeela: Novel corpus of Arabic vocalized texts, data for auto-diacritization systems

Abstract: Arabic diacritics are often missed in Arabic scripts. This feature is a handicap for new learner to read َArabic, text to speech conversion systems, reading and semantic analysis of Arabic texts.The automatic diacritization systems are the best solution to handle this issue. But such automation needs resources as diactritized texts to train and evaluate such systems.In this paper, we describe our corpus of Arabic diacritized texts. This corpus is called Tashkeela. It can be used as a linguistic resource tool f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 59 publications
(29 citation statements)
references
References 10 publications
(9 reference statements)
0
29
0
Order By: Relevance
“…Mishkal [44] is an application which diacritize a text by generating the possible diacritized word forms through the detection of affixes and the use of a dictionary, then limiting them using semantic relations, and finally choosing the most likely diacritization. Tashkeela-Model [11] uses a basic N-gram language model on character level trained on the Tashkeela corpus [45]. Shakkala [13] is a character-level deep learning system made of an embedding, three bidirectional LSTM, and dense layers.…”
Section: Rule-based Approaches the Used Methods Include Cascading Wementioning
confidence: 99%
See 1 more Smart Citation
“…Mishkal [44] is an application which diacritize a text by generating the possible diacritized word forms through the detection of affixes and the use of a dictionary, then limiting them using semantic relations, and finally choosing the most likely diacritization. Tashkeela-Model [11] uses a basic N-gram language model on character level trained on the Tashkeela corpus [45]. Shakkala [13] is a character-level deep learning system made of an embedding, three bidirectional LSTM, and dense layers.…”
Section: Rule-based Approaches the Used Methods Include Cascading Wementioning
confidence: 99%
“…In this work, the Tashkeela corpus [45] was mainly used for training and testing our model. This dataset is made of 97 religious books written in the Classical Arabic style, with a small part of web crawled text written in the Modern Standard Arabic style.…”
Section: Datasetmentioning
confidence: 99%
“…This corpus contains 6,000 texts (1 billion words). Zerrouki and Balla (2017) propose a large freely available vocalized corpus, containing 75 million words, collected from freely published texts in old books. Asda et al (2016), propose the development of Quran reciter recognition and identification system, based on Mel-Frequency Cepstral Coefficient (MFCC) feature extraction and Artificial Neural Networks.…”
Section: Building Resources (Br)mentioning
confidence: 99%
“…Corpora [29], the King Saud University corpus of Classical Arabic [30], Alwatan [31], Tashkeela [32] and the Al Khaleej Corpus [33]. The monolingual corpora consist of a raw text written in a single language.…”
Section: ) Raw Text Corpora Can Be Divided Intomentioning
confidence: 99%