On entropy-based term weighting schemes for text categorization

Wang, Tao; Cai, Yi; Leung, Ho-fung; Lau, Raymond Y. K.; Xie, Haoran; Li, Qing

doi:10.1007/s10115-021-01581-5

Cited by 15 publications

(17 citation statements)

References 73 publications

(65 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the supervised text mining task of document classification, different approaches utilizing class information are proposed to estimate the collection-based term weighting factors (Wang and Zhang, 2013;Debole and Sebastiani, 2003;Lan et al, 2009). Inverse category frequency (icf) (Wang and Zhang, 2013) has been shown to produce better classification result than the traditional idf factor with the cosine similarity measure. It considers the distribution of a term among classes rather than among documents in the given collection.…”

Section: Discussionmentioning

confidence: 99%

“…It considers the distribution of a term among classes rather than among documents in the given collection. The intuition behind icf is that the fewer classes a term t i occurs in, the more discriminating power the term t i contributes to classification (Wang and Zhang, 2013). If C and c i are the total number of classes and the number of classes in which t i occurs at least once in at least one document, then the icf factor is estimated as: icf (t i ) = log 1 + C c i .…”

Section: Discussionmentioning

confidence: 99%

“…If C and c i are the total number of classes and the number of classes in which t i occurs at least once in at least one document, then the icf factor is estimated as: icf (t i ) = log 1 + C c i . We have evaluated the performance of Sp in the kNN document classification task with existing measures using the supervised term weighting scheme of icf (Wang and Zhang, 2013). Sp produced either better or competitive classification results with existing measures using supervised or unsupervised term weighting in the 5NN classification task.…”

Section: Discussionmentioning

confidence: 99%

“…In the field of IR, there has been considerable research investigating the effective term weighting scheme. The importance of a term t i in document x, w i (x), is estimated using different variants and combinations of two factors (Manning et al, 2008;Salton and Buckley, 1988;Joachims, 1997;Robertson et al, 1994;Singhal, 1997;Roberston and Zaragoza, 2009;Paltoglou and Thelwall, 2010;Han et al, 2012;Wang and Zhang, 2013): (i) documentbased factor based on the frequency of t i in x, x i ; and (ii) collection-based factor based on the number of documents where t i occurs, n i . The most widely used term weighting scheme is tf-idf (term frequency -inverse document frequency) where w i (x) = tf i (x)× idf (t i ) (Manning et al, 2008;Salton and Buckley, 1988); and it consists of: i. Document-based factor: tf i (x) = 1 + log(x i ) if x i > 0, and 0 otherwise; ii.…”

Section: Term Weightingmentioning

confidence: 99%

“…Prior research in the BoW inter-document similarity measurement task were focused on developing effective term weighting schemes to improve the task specific performances of existing measures such as cosine and Best Match 25 (BM25) (Salton and Buckley, 1988;Robertson et al, 1994;Joachims, 1997;Singhal, 1997;Roberston and Zaragoza, 2009;Paltoglou and Thelwall, 2010;Han et al, 2012;Wang and Zhang, 2013). In contrast, we investigate an alternative similarity measure where an adjustment of vector components using term weighting is not required.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

A new simple and effective measure for bag-of-word inter-document similarity measurement

Aryal,

Ting,

Washio

et al. 2019

Preprint

View full text Add to dashboard Cite

To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity-the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effective similarity measure which does not require explicit term weighting. The proposed measure employs a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents. Our empirical comparison with the existing similarity measures using different term weighting schemes shows that the new measure produces (i) better results in the binary BoW representation; and (ii) competitive and more consistent results in the term-frequency-based BoW representation.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Term Weightingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

A new simple and effective measure for bag-of-word inter-document similarity measurement

Aryal,

Ting,

Washio

et al. 2019

Preprint

View full text Add to dashboard Cite

show abstract

Classification and Recognition of Internet Literature Text Based on Data Mining Technology

Xiong

et al. 2022

Cyber Security Intelligence and Analytics

View full text Add to dashboard Cite

An improved term weighting method based on relevance frequency for text classification

Li²,

Zhong³

et al. 2022

Soft Comput

View full text Add to dashboard Cite

As a vital step of text classification (TC) task, the assignment of term weight has a great influence on the performance of TC. Currently, masses of term weighting schemes can be utilized, such as term frequency-inverse documents frequency (TF-IDF) and term frequency-relevance frequency (TF-RF), and they are all consisted of local part (TF) and global part (e.g., IDF, RF). However, most of these schemes adopt the logarithmic processing on their respective global parts, and it is natural to consider whether the logarithmic processing apply to all these schemes or not. Actually, for a specific term weighting scheme, due to its different ratio of local weight and global weight resulting from logarithmic processing, it usually shows diverse text clasification results on different text sets, which presents poor robustness. To explore the influence of logarithmic processing imposed on the global weight on the classification result of term weighting schemes, TF-RF is selected as the representative because it can achieve a better performance among these schemes adopting logarithmic processing. Then, two propositions along with corresponding methods about the relation between TF part and RF part are proposed based on TF-RF. In addition, two groups of experiments are conducted on the two methods. The first group of experiments proves that one method (denoted as TF-ERF) is more helpful to the improvement than the other one (denoted as ETF-RF). The second group of experiments shows that TF-ERF not only ourperforms TF-RF but also obtains better performance than other existing term weighting schemes.

show abstract

On entropy-based term weighting schemes for text categorization

Cited by 15 publications

References 73 publications

A new simple and effective measure for bag-of-word inter-document similarity measurement

A new simple and effective measure for bag-of-word inter-document similarity measurement

Classification and Recognition of Internet Literature Text Based on Data Mining Technology

An improved term weighting method based on relevance frequency for text classification

Contact Info

Product

Resources

About