The Role of Rare Terms in Enhancing the Performance of Polynomial Networks Based Text Categorization

Al-Tahrawi, Mayy M.

doi:10.4236/jilsa.2013.52009

Cited by 4 publications

(9 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The significance of low frequent terms in TC performance was always debatable. A recent study has proved that keeping low frequent terms can enhance polynomial networks (PN)‐based TC of the Reuters Data Set to a great extent, regardless of the term‐weighting scheme adopted or the term‐reduction method used . The enhancement on the accuracy recorded when keeping the low frequent terms in this research was great; it reached 17% in some experiments.…”

Section: Introductionmentioning

confidence: 61%

See 1 more Smart Citation

The Significance Of Low Frequent Terms in Text Classification

Al-Tahrawi

2014

Int. J. Intell. Syst.

Self Cite

View full text Add to dashboard Cite

The significance of low frequent terms in text classification (TC) was always debatable. These terms were often accused of adding noise to the TC process. Nevertheless, some recent studies have proved that they are very helpful in improving the performance of text classifiers. This paper shows the significance of low frequent terms in enhancing the performance of English TC, in terms of precision, recall, F-measure, and accuracy. Six well-known TC algorithms are tested on the benchmark Reuters Data Set, once keeping low frequent terms and another time discarding them. These algorithms are the support vector machines, logistic regression, k-nearest neighbor, naive bayes, the radial basis function networks, and polynomial networks. All the experiments in this research have shown a superior performance of TC when the low frequent terms are used in classification. C 2014 Wiley Periodicals, Inc.

show abstract

Section: Introductionmentioning

confidence: 61%

“…The research conducted in Ref. is extended here to investigate the significance of low frequent terms in TC using other state‐of‐the‐art TC algorithms. Furthermore, additional performance measures are used here to investigate the significance of low frequent terms in TC.…”

Section: Introductionmentioning

confidence: 99%

The Significance Of Low Frequent Terms in Text Classification

Al-Tahrawi

2014

Int. J. Intell. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Chi Square (CHI) is used in the experiments of this research as a FS metric for selecting the most discriminating features in the dataset. CHI has proved to record high accuracy in classifying both English [7,6,16,[61][62][63][64][65][66] and Arabic [5,6,16,[55][56][57][58] texts. The CHI FS metric measures the lack of independence between a term and a class.…”

Section: A Feature Selection (Fs)mentioning

confidence: 99%

“…After deciding on the terms to be selected for building the classifier, the terms will be represented in the categorization system using one of the various presentations or weights used in the literature of TC. [3,5,9,14,56,59], Term Frequency (TF) [14,15,55,57,58], Document Frequency (DF) [55], Weighted IDF [14], Normalized Frequency [7,16,[60][61][62][63][64], Boolean [6,55,61,62,64] and other FS methods like Cosine coefficient, Dice coefficient and Jacaard coefficient [68]. In this research, Normalized frequency is used to as a weighting scheme for term representation in the Vector Space Model.…”

Section: A Feature Selection (Fs)mentioning

confidence: 99%

“…A class-based local policy is applied, in this research, for selecting the best terms for building all the classifiers by selecting 1% of the topmost terms from each of the five classes. This policy has proved to achieve the best categorization performance compared to other reduction policies, like choosing the topmost corpus terms, or an equal number of terms from each class, as it gives each class a representative share in the final set of terms used to build the classifier [16,[61][62][63][64]69]. The number of terms selected from each class and the total number of terms, after applying CHI and Feature Reduction, then removing duplicates is summarized in Table 3.…”

Section: B Feature Reductionmentioning

confidence: 99%

See 1 more Smart Citation

Arabic Text Categorization Using Logistic Regression

Al-Tahrawi¹

2015

IJISA

Self Cite

View full text Add to dashboard Cite

Several Text Categorization (TC) techniques and algorithms have been investigated in the limited research literature of Arabic TC. In this research, Logistic Regression (LR) is investigated in Arabic TC. To the best of our knowledge, LR was never used for Arabic TC before. Experiments are conducted on Aljazeera Arabic News (Alj-News) dataset. Arabic text-preprocessing takes place on this dataset to handle the special nature of Arabic text. Experimental results of this research prove that the LR classifier is a competitive Arabic TC algorithm to the state of the art ones in this field; it has recorded a precision of 96.5% on one category and above 90% for 3 categories out of the five categories of Alj-News dataset. Regarding the overall performance, LR has recorded a macroaverage precision of 87%, recall of 86.33% and Fmeasure of 86.5%.

show abstract