Hamshahri: A standard Persian text collection

AleAhmad, Abolfazl; Amiri, Hadi; Darrudi, Ehsan; Rahgozar, Maseud; Oroumchian, Farhad

doi:10.1016/j.knosys.2009.05.002

Cited by 108 publications

(44 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For Arabic, we use the BACC [16]. For Persian, the Hamshahri corpus is used [17]. The LCMC corpus is used for Chinese [18] and the CEG corpus is used for Welsh [19].…”

Section: Resultsmentioning

confidence: 99%

Grammar Based Pre-Processing for PPM

Teahan¹,

Aljehane²

2017

IJCSIT

View full text Add to dashboard Cite

show abstract

“…For Arabic, we use the BACC [16]. For Persian, the Hamshahri corpus is used [17]. The LCMC corpus is used for Chinese [18] and the CEG corpus is used for Welsh [19].…”

Section: Resultsmentioning

confidence: 99%

Grammar Based Pre-Processing for PPM

Teahan¹,

Aljehane²

2017

IJCSIT

View full text Add to dashboard Cite

show abstract

“…A collection including 3000 images borrowed from Hamshahri dataset was used in our experiments [16]. The main reason for using this dataset was the fact that this dataset is extracted from an online news website so the images have come with very precise details about the category or categories the image belong to and each image is supported with a list of keywords which describe its contents.…”

Section: Resultsmentioning

confidence: 99%

A Method for Image Retrieval using Combination of Color and Frequency Layers

Azodinia¹,

Hajdú²

2015

IJCA

View full text Add to dashboard Cite

In this paper a fast and effective noise-resistant method for image retrieval has been proposed. In this method, first, the image is decomposed into different frequency layers using complex wavelet transform so as to make it possible to extract the texture features of the image. Thereafter, in the HSV color space, each layer is quantized into 166 different colors and the color histogram is calculated for each layer. Furthermore, a number of statistical features are extracted from each subimage using complex wavelet transform, which are used along with other features for image retrieval. In order to verify the effectiveness of the proposed method, it has been evaluated using a dataset containing 3000 images and compared to a competent method in this field. The results prove the superiority of the proposed method. General TermsImage retrieval system, image processing, color histograms, texture and statistical features KeywordsColor feature, complex wavelet transform, Content-based image retrieval, feature extraction, histogram, image processing and texture feature.

show abstract

“…The PPM algorithm itself is used unchanged (as a black-box component), and only parameters such as the escape mechanism and the order of the model have been adjusted. The impact of the text preprocessing algorithms are examined using different file sizes and text genres from the Bangor Arabic Compression Corpus (BACC) [9] of Arabic text and other corpora such as the Hamshahri corpus of Persian text [10], the HC corpus of Armenian text [11], the HC corpus of Russian text [11], the LCMC corpus of Chinese text [12], the CEG corpus of Welsh text [13], and the Brown [14] and LOB [15] corpora of American and British English text respectively. …”

Section: Utf-8 Encodingmentioning

confidence: 99%

“…Usually, each language has common bigraphs that represent a significant percentage of the text. For example, examining the most frequent 20 bigraphs over five different languages using 500,000 words from various corpora (CCA [16], HC-Vietnamese [11], HCArmenian, Brown and LOB corpora) produces some interesting results as showed in Table 1. The top 20 bigraphs take up almost 10% of the Vietnamese and English texts.…”

Section: Bigraphs and Languagesmentioning

confidence: 99%

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

J.Teahan¹,

M.Alhawiti²

2015

IJCSIT

View full text Add to dashboard Cite

KEYWORDSPreprocessing, PPM, UTF-8, Encoding. BACKGROUND Prediction by Partial Matching (PPM)One of the most powerful text compression techniques is Prediction by Partial Match (PPM), which was first introduced by Cleary and Witten [1]. A series of improvements have been applied to the original PPM algorithm, such as the PPMC version by Moffat [2] and PPM* by Cleary & Teahan [3]. The PPM text compression algorithm applies a statistical approach; it simply uses the set of previous symbols to predict the upcoming symbol in the stream. Variants of the PPM algorithm (such as PPMC and PPMD) are distinguished by the escape mechanism used to backoff to lower order models when new symbols are encountered in the context. PPM has also been applied successfully too many natural language processing (NLP) applications such as cryptology, language identification, and text correction [4], [5]. Abel and Teahan [6] presented several universal text preprocessing techniques that they applied prior to the application of various standard text compression algorithms. They found that in many cases the compression performance was significantly improved by applying the text processing techniques. In order to recover the original file during decoding, the decompression algorithm was applied first, and then postprocessing was performed that reversed the effect of the preprocessing stage. Universal text preprocessing for data compression

show abstract

Hamshahri: A standard Persian text collection

Cited by 108 publications

References 11 publications

Grammar Based Pre-Processing for PPM

Grammar Based Pre-Processing for PPM

A Method for Image Retrieval using Combination of Color and Frequency Layers

Preprocessing for PPM: Compressing Utf-8 Encoded Natural Language Text

Contact Info

Product

Resources

About