2010
DOI: 10.1007/s10579-010-9132-x
|View full text |Cite
|
Sign up to set email alerts
|

Lessons from building a Persian written corpus: Peykare

Abstract: This paper addresses some of the issues learned during the course of building a written language resource, called 'Peykare', for the contemporary Persian. After defining five linguistic varieties and 24 different registers based on these linguistic varieties, we collected texts for Peykare to do a linguistic analysis, including cross-register differences. For tokenization of Persian, we propose a descriptive generalization to normalize orthographic variations existing in texts. To annotate Peykare, we use EAGL… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 60 publications
(25 citation statements)
references
References 17 publications
(15 reference statements)
0
22
0
Order By: Relevance
“…In this research several experiments were conducted to convert grapheme sequence to the phoneme sequence for the Persian language using CURRENNT. To train the model two corpuses are used: FarsDat [24] with more than 500,000 words; and labeled section of the Persian written corpus [25] which has over 10 million labeled words. The goal is to map a sequence of graphemes to a sequence of phonemes.…”
Section: Letter Pronunciation Candidatesmentioning
confidence: 99%
“…In this research several experiments were conducted to convert grapheme sequence to the phoneme sequence for the Persian language using CURRENNT. To train the model two corpuses are used: FarsDat [24] with more than 500,000 words; and labeled section of the Persian written corpus [25] which has over 10 million labeled words. The goal is to map a sequence of graphemes to a sequence of phonemes.…”
Section: Letter Pronunciation Candidatesmentioning
confidence: 99%
“…UPC is another Farsi corpus developed by Seraji et al [2012] and is a modified version of Bijankhan with additional sentences including 2.7M words annotated with 32 POS tags. 10 Another corpus that includes 10M tagged words is the Peykareh corpus [Bijankhan et al 2011], which unfortunately is not publicly available. In Farsi NLP tasks, the Bijankhan corpus is frequently used, and to make our work comparable, we also used this dataset.…”
Section: Training Corpusmentioning
confidence: 99%
“…On the other hand, the lack of the tagging especially in half-space rule leaves more unedited multi-part words in evaluation step. Moreover, available Persian tagged corpus such as Peykare [5] does not comply with half-space character. In this paper, we propose a different statistical approach which uses a fertility-based IBM Model [6] as word alignment by employing a parallel corpus which is created for the special purpose of Persian multi-part word edition.…”
Section: ‫حاصل‬ ‫ضرب‬mentioning
confidence: 99%