Multi-Cluster Based Approach for skewed Data in Data Mining

Dongre, Snehlata; Malik, Latesh

doi:10.9790/0661-1266673

Cited by 15 publications

(11 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Finally, it combines the selected negative instances from the K clusters with all the instances in the minority class. A similar under-sampling method corresponds to the algorithm proposed by Longadge et al [36], which firstly clusters the majority class instances into K groups using the K-means algorithm and then selects |C + | × IR i majority class instances from each cluster i, where IR i denotes the imbalance ratio in the cluster i. Note that the aim of this method is not to obtain a perfectly balanced class distribution, but to reduce the disproportion between the size of the majority and minority classes.…”

Section: Clustering-based Algorithmsmentioning

confidence: 99%

“…Regarding the differences between DBMIST-US and the existing clustering-based methods, it is worth pointing out that most under-sampling techniques rely upon the K-means and the fuzzy K-means algorithms [16,26,[36][37][38]40,41,43,51]. However, it is well-known that K-means may not be sufficiently effective when applied to imbalanced data because it always generates clusters with similar sizes [52].…”

Section: Differences Between Dbmist-us and Related Workmentioning

confidence: 99%

See 1 more Smart Citation

A New Under-Sampling Method to Face Class Overlap and Imbalance

et al. 2020

View full text Add to dashboard Cite

Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases.

show abstract

Section: Clustering-based Algorithmsmentioning

confidence: 99%

Section: Differences Between Dbmist-us and Related Workmentioning

confidence: 99%

A New Under-Sampling Method to Face Class Overlap and Imbalance

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Several studies reported the difficulty of clustering or classification of the Yeast data set. As Longadge et al (2013) reported, classification of the Yeast data set was done by several classification methods such as K-NN. The K-NN ( K = 3) was able to classify the Yeast data set with 0.11% accuracy by F -measure after several epochs and times running the method.…”

Section: Data Sets From Uci Repositorymentioning

confidence: 99%

A dynamic semisupervised feedforward neural network clustering

Asadi

Kareem

Asadi³

et al. 2016

AIEDAM

View full text Add to dashboard Cite

An efficient single-layer dynamic semisupervised feedforward neural network clustering method with one epoch training, data dimensionality reduction, and controlling noise data abilities is discussed to overcome the problems of high training time, low accuracy, and high memory complexity of clustering. Dynamically after the entrance of each new online input datum, the code book of nonrandom weights and other important information about online data as essentially important information are updated and stored in the memory. Consequently, the exclusive threshold of the data is calculated based on the essentially important information, and the data is clustered. Then, the network of clusters is updated. After learning, the model assigns a class label to the unlabeled data by considering a linear activation function and the exclusive threshold. Finally, the number of clusters and density of each cluster are updated. The accuracy of the proposed model is measured through the number of clusters, the quantity of correctly classified nodes, and F-measure. Briefly, in order to predict the survival time, the F-measure is 100% of the Iris, Musk2, Arcene, and Yeast data sets and 99.96% of the Spambase data set from the University of California at Irvine Machine Learning Repository; and the superior F-measure results in between 98.14% and 100% accuracies for the breast cancer data set from the University of Malaya Medical Center. We show that the proposed method is applicable in different areas, such as the prediction of the hydrate formation temperature with high accuracy.

show abstract

“…The K-NN (K=3) was able to classify the Yeast dataset with 0.11% accuracy by F-measure after several epochs and times running the method. Also, Ahirwar [69] reported the K-means was able to classify the Yeast dataset with 65.00% accuracy by F-measure after several epochs.…”

Section: Yeast Datasetmentioning

confidence: 99%

A Single-Layer Semi-Supervised Feed Forward Neural Network Clustering Method

Asadi

Kareem

Asadi

et al. 2015

MJCS

View full text Add to dashboard Cite

The aim of this research is to develop and propose a single-layer semi-supervised feed forward neural network clustering method with one epoch training in order to solve the problems of low training speed, accuracy and high time and memory complexities of clustering. A code book of non-random weights is learned through the input data directly. Then, the best match weight (BMW) vector is mined from the code book, and consequently an exclusive total threshold of each input data is calculated based on the BMW vector. The input data are clustered based on their exclusive total thresholds. Finally, the method assigns a class label to each input data by using a K-step activation function for comparing the total thresholds of the training set and the test set. The class label of other unlabeled and unknown input test data are predicted based on their clusters or trial and

show abstract

Multi-Cluster Based Approach for skewed Data in Data Mining

Cited by 15 publications

References 17 publications

A New Under-Sampling Method to Face Class Overlap and Imbalance

A New Under-Sampling Method to Face Class Overlap and Imbalance

A dynamic semisupervised feedforward neural network clustering

A Single-Layer Semi-Supervised Feed Forward Neural Network Clustering Method

Contact Info

Product

Resources

About