On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

Doğan, Turgut; Uysal, Alper Kürşat

doi:10.1007/s13369-019-03920-9

Cited by 26 publications

(18 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Some variants of the classical TF scheme are inverse term frequency (ITF) [19], which normalizes the values to the interval [0,1] based on Zipf's Law, and other transformations on the term-frequency values where terms that are extremely frequent do not increase at the same rate as in TF [5]. In [10], Global factors are designed to improve precision although this might be at the expense of a drop in recall. The rationale behind these factors is that common terms are poor discriminators.…”

Section: Background and Related Workmentioning

confidence: 99%

Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval

Maisonnave¹,

Delbianco²,

Tohmé³

et al. 2020

Preprint

View full text Add to dashboard Cite

This article analyses and evaluates FDD β , a supervised term-weighting scheme that can be applied for query-term selection in topic-based retrieval.FDD β weights terms based on two factors representing the descriptive and discriminating power of the terms with respect to the given topic. It then combines these two factor through the use of an adjustable parameter that allows to favor different aspects of retrieval, such as precision, recall or a balance between both. The article makes the following contributions: (1) it presents an extensive analysis of the behavior of FDD β as a function of its adjustable parameter; (2) it compares FDD β against eighteen traditional and state-of-the-art weighting scheme; (3) it evaluates the performance of disjunctive queries built by combining terms selected using the analyzed methods; (4) it introduces a new public data set with news labeled as relevant or irrelevant to the economic domain. The analysis and evaluations are performed on three data sets: two well-known text data sets, namely 20 Newsgroups and Reuters-21578, and the newly released data set. It is possible to conclude that despite its simplicity, FDD β is competitive with state-of-the-art methods and has the important advantage of offering flexibility at the moment of adapting

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval

Maisonnave¹,

Delbianco²,

Tohmé³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Here, an alternative approach is adopted in this study, that is, raw TF is square root (named RTF), namely replacing raw frequency tf ( t j , d k ) with

\sqrt{italictf (() t_{j}, d_{k})}

. In general, term weighting schemes using square root function–based TF factor is superior to the logarithmic function–based TF factor . Besides, the inverse exponential frequency (

e^{- italicdf ((), t_{j}) false/ N}

) in Equation can be regarded as an adjustment coefficient to reduce TF appropriately.…”

Section: Improved Weighting Scheme and Its Various Variantsmentioning

confidence: 99%

“…In general, term weighting schemes using square root function-based TF factor is superior to the logarithmic function-based TF factor. 49 Besides, the inverse exponential frequency (5) can be regarded as an adjustment coefficient to reduce TF appropriately. Meanwhile, note that the formula df(t j )/N is always less than 1 because the numerator df(t j ) is always less than or equal to denominator N. For this reason, we compute the square root of IEF (named RIEF) to further reduce the value of e −df(t j )∕N , namely replacing e −df(t j )∕N with e − √ df(t j )∕N .…”

Section: Modified Variations Of Tf-ief Schemementioning

confidence: 99%

An improved term weighting scheme for text classification

Zhong

2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Text representation is a necessary and primary procedure in performing text classification (TC), which first needs to be obtained through an information‐rich term weighting scheme to achieve higher TC performance. So far, term frequency‐inverse document frequency (TF‐IDF) is the most widely used term weighting scheme, but it suffers from two deficiencies. First, the global weighting factors IDF in TF‐IDF approaches infinity if a certain term does not occur in a text. Second, the IDF is equal to zero if a certain term appears in any text. To offset these drawbacks, we first conduct an in‐depth analysis of the current term weighting schemes, and subsequently, an improved term weighting scheme called term frequency‐inverse exponential frequency (TF‐IEF) and its various variants are proposed. The proposed method replaces IDF with the new global weighting factor IEF to characterize the global weighting factor log‐like IDF in the corpus, which can greatly reduce the effect of feature (term) with high local weighting factor TF in term weighting. As a result, a more representative feature can be generated. We carried out a series of experiments on two commonly used data sets (corpora) utilizing Naïve Bayes and support vector machine classifiers to validate the performance of our proposed schemes. Experimental results explicitly reveal that the proposed term weighting schemes come with better performance than the compared schemes.

show abstract

“…Feature selection is a nontrivial preprocessing technique that alleviates the problem of high dimensionality. It reduces the number of features by counting the overall frequencies, 13 or by considering classes overlapping, 14 or by using denoising autoencoders, 15 and so forth. A more accurate technique is presented in this article to reduce the original text features into a most relevant subset of significant terms.…”

Section: Introductionmentioning

confidence: 99%

Supervised classification by thresholds: Application to automated text categorization and opinion mining

Cherif¹,

Madani

Kissi³

2021

Concurrency and Computation

View full text Add to dashboard Cite

Over recent years, the world has experienced explosive growth in the volume of textual data, which makes a manual analysis impossible. Machine learning techniques provided an effective solution to this problem. Due to its capacity to organize the huge and varied amounts of data, it offered valuable insights and it has become an emerging investigative field for the research community. Classification techniques are used to classify data into different classes according to desired criteria. By their simplicity, they give rise to a variety of applications: automated text categorization, opinion mining, and so forth. These processes go through three stages: text representation, features extraction, and the classification process; they still face many difficulties due both to the complex nature of text databases and to the high dimensionality of texts representations. This article presents a new classification approach that learns to classify texts from the most reliable features more accurately. The added advantage of the proposed approach is that it automatically classifies a text without necessarily processing all its features. The experimental results showed that this new classification by thresholds outperforms the state‐of‐the‐art methods. As a result, the obtained f‐measure on automatic text categorization was 95.06% while it is lower on opinion mining.

show abstract

On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification

Cited by 26 publications

References 40 publications

Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval

Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval

An improved term weighting scheme for text classification

Supervised classification by thresholds: Application to automated text categorization and opinion mining

Contact Info

Product

Resources

About