A bi-directional sampling based on K-means method for imbalance text classification

Song, Jia; Huang, Xianglin; Qin, Sijun; Song, Qing

doi:10.1109/icis.2016.7550920

Cited by 38 publications

(18 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cluster sampling methods were also used by [27], which introduced the process of cluster density and boundary density thresholds to determine the cluster and sampling boundary. The literature [28] used a method called a bidirectional sampling based on K-means clustering, which performed very well with data that had too much noise and few samples. Each of these sampling techniques has its benefits and drawbacks, which are very subjective and depend on the context of the application and usage [29].…”

Section: A Sampling Based Techniquesmentioning

confidence: 99%

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

et al. 2019

View full text Add to dashboard Cite

Data are being generated and used to support all aspects of healthcare provision, from policy formation to the delivery of primary care services. Particularly, with the change of emphasis from curative to preventive medicine, the importance of data-based research such as data mining and machine learning has emphasized the issues of class distributions in datasets. In typical predictive modeling, the inability to effectively address a class imbalance in a real-life dataset is an important shortcoming of the existing machine learning algorithms. Most algorithms assume a balanced class in their design, resulting in poor performance in predicting the minority target class. Ironically, the minority target class is usually the focus in predicting processes. The misclassification of the minority target class has resulted in serious consequences in detecting chronic diseases and detecting fraud and intrusion where positive cases are erroneously predicted as not positive. This paper presents a new attribute selection technique called variance ranking for handling imbalance class problems in a dataset. The results obtained were compared to two well-known attribute selection techniques: the Pearson correlation and information gain technique. This paper uses a novel similarity measurement technique ranked order similarity-ROS to evaluate the variance ranking attribute selection compared to the Pearson correlations and information gain. Further validation was carried out using three binary classifications: logistic regression, support vector machine, and decision tree. The proposed variance ranking and ranked order similarity techniques showed better results than the benchmarks. The ROS technique provided an excellent means of grading and measuring the similarities where other similarity measurement techniques were inadequate or not applicable.

show abstract

Section: A Sampling Based Techniquesmentioning

confidence: 99%

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

et al. 2019

View full text Add to dashboard Cite

show abstract

“…This method identifies the regions for oversampling by using the clusters to ensure deviation from over generalization between the samples. Another clustering based approach, bi-directional sampling based on k-means method is proposed in [41], which uses the hybrid solution of both resampling techniques, oversampling and undersampling, with k-means for the imbalances text classification problem. This method eliminates the between-class imbalance problem and within-class imbalance problem, along with avoiding the generation of noise in the data.…”

Section: Background Studymentioning

confidence: 99%

Adaptive Semi-Unsupervised Weighted Oversampling with Sparsity Factor for Imbalanced Biomedical Data

Ali¹,

Samat²,

Ashgher³

2020

JSCDM

View full text Add to dashboard Cite

“…Although approaches for overcoming the challenges posed by data imbalance have been proposed in many previous studies, such as [35][36][37][38][39][40][41][42][43][44][45][46], the issue of imbalanced data in machine learning studies still remains unresolved. In some of the primary studies selected in this SLR, such as [47][48][49], resampling techniques have been applied to address this problem. In addition, reweighting has been applied in previous studies, such as [50][51][52], to address the imbalance problem.…”

Section: Class Imbalancementioning

confidence: 99%

“…However, among these primary studies, there is sufficient evidence that SLR studies on data preprocessing are lacking, as indicated by the fact that only 2% of the primary studies considered in this study followed SLR guidelines. This finding [136] N P N Y 1.5 [105] N N N Y 1.0 [86] N Y N Y 2.0 [137] N N N Y 1.0 [106] N P N Y 1.5 [83] N Y N Y 2.0 [138] N P N Y 1.5 [139] N P N Y 1.5 [140] N N N Y 1.0 [68] N Y N Y 2.0 [141] N P N Y 1.5 [142] N P N Y 1.5 [67] N P N Y 1.5 [90] N P N Y 1.5 [50] N P N Y 1.5 [53] N N N Y 1.0 [52] N P N Y 1.5 [47] N P N Y 1.5 [48] N N N Y 1.0 [143] N P N Y 1.5 [117] N Y N Y 2.0 [144] N N N Y 1.0 [145] N P N Y 1.5 [73] Y Y Y Y 4.0 [146] N P N Y 1.5 [147] N P N Y 1.5 [148] N N N Y 1.0 [149] N P N Y 1.5 [150] N P N Y 1.5 [60] N P N Y 1.5 [151] N N N Y 1.0 [152] N P N Y 1.5 [153] N N N Y 1.0 [154] N N N Y 1.0 [155] N N N Y 1.0 [156] N N N Y 1.0 [157] N N N Y 1.0 [158] N P N Y 1.5 [159] N N N Y 1.0 [160] N N N Y 1.0 [161] N N N Y 1.0 [162] N N N Y 1.0…”

Section: What Are the Limitations Of Current Research?mentioning

confidence: 99%

Systematic literature review of preprocessing techniques for imbalanced data

Felix

Lee

2019

IET softw.

View full text Add to dashboard Cite

Data preprocessing remains an important step in machine learning studies. This is because proper preprocessing of imbalanced data can enable researchers to reduce defects as much as possible, which, in turn, may lead to the elimination of defects in existing data sets. Despite the remarkable achievements that have been accomplished in machine learning studies, systematic literature reviews of imbalanced data preprocessing techniques are lacking. Consequently, there are a limited number of systematic literature review studies on imbalanced data preprocessing. In this study, the authors assess the existing literature to identify the key issues related to data quality and handling and to provide a convenient collection of the techniques used to address these issues when performing data preprocessing. They applied a systematic literature review method involving a manual search to select articles published from January 2010 to September 2018 for review. The qualities of the existing studies were assessed using certain quality assessment criteria. Of the 118 relevant studies found, only 2% were identified as having been conducted following systematic literature review guidelines. This study, therefore, calls for more systematic literature review studies on data preprocessing to improve the quality of the data applied in machine learning studies.

show abstract

A bi-directional sampling based on K-means method for imbalance text classification

Cited by 38 publications

References 12 publications

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

Variance Ranking Attributes Selection Techniques for Binary Classification Problem in Imbalance Data

Adaptive Semi-Unsupervised Weighted Oversampling with Sparsity Factor for Imbalanced Biomedical Data

Systematic literature review of preprocessing techniques for imbalanced data

Contact Info

Product

Resources

About