2009
DOI: 10.1016/j.knosys.2009.05.002
|View full text |Cite
|
Sign up to set email alerts
|

Hamshahri: A standard Persian text collection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
37
0
3

Year Published

2015
2015
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 108 publications
(44 citation statements)
references
References 11 publications
0
37
0
3
Order By: Relevance
“…For Arabic, we use the BACC [16]. For Persian, the Hamshahri corpus is used [17]. The LCMC corpus is used for Chinese [18] and the CEG corpus is used for Welsh [19].…”
Section: Resultsmentioning
confidence: 99%
“…For Arabic, we use the BACC [16]. For Persian, the Hamshahri corpus is used [17]. The LCMC corpus is used for Chinese [18] and the CEG corpus is used for Welsh [19].…”
Section: Resultsmentioning
confidence: 99%
“…A collection including 3000 images borrowed from Hamshahri dataset was used in our experiments [16]. The main reason for using this dataset was the fact that this dataset is extracted from an online news website so the images have come with very precise details about the category or categories the image belong to and each image is supported with a list of keywords which describe its contents.…”
Section: Resultsmentioning
confidence: 99%
“…The PPM algorithm itself is used unchanged (as a black-box component), and only parameters such as the escape mechanism and the order of the model have been adjusted. The impact of the text preprocessing algorithms are examined using different file sizes and text genres from the Bangor Arabic Compression Corpus (BACC) [9] of Arabic text and other corpora such as the Hamshahri corpus of Persian text [10], the HC corpus of Armenian text [11], the HC corpus of Russian text [11], the LCMC corpus of Chinese text [12], the CEG corpus of Welsh text [13], and the Brown [14] and LOB [15] corpora of American and British English text respectively. …”
Section: Utf-8 Encodingmentioning
confidence: 99%
“…Usually, each language has common bigraphs that represent a significant percentage of the text. For example, examining the most frequent 20 bigraphs over five different languages using 500,000 words from various corpora (CCA [16], HC-Vietnamese [11], HCArmenian, Brown and LOB corpora) produces some interesting results as showed in Table 1. The top 20 bigraphs take up almost 10% of the Vietnamese and English texts.…”
Section: Bigraphs and Languagesmentioning
confidence: 99%