RETRACTED ARTICLE: Research on semi supervised K-means clustering algorithm in data mining

Mai, Xiaodong; Cheng, Jing; Wang, Shengnan

doi:10.1007/s10586-018-2199-7

Cited by 21 publications

(9 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The K-means clustering algorithm will be used in this research, not only because it's one of the most commonly used clustering techniques but also because it has been applied in many scientific and technological fields [6,19,27]. The K-means method has not only suffered from a major problem of which the algorithm produces empty clusters [3] added to that the problem produced by the random nature of cluster's initial centres selection that causes the algorithm to tend to sub optimal solutions [17]. Kmeans clustering algorithm will be used to group transcribed textual documents obtained from audio sources into topics by applying a similarity measure based on the Chi-square method, which is designed to eliminate non informative words that will more likely be erroneous words when applied on transcribed documents [5].…”

Section: K-means Clustering Algorithmmentioning

confidence: 99%

Self-Organizing Map vs Initial Centroid Selection Optimization to Enhance K-Means with Genetic Algorithm to Cluster Transcribed Broadcast News Documents

Maghawry

Omar

Badr

2019

IAJIT

View full text Add to dashboard Cite

A compilation of artificial intelligence techniques are employed in this research to enhance the process of clustering transcribed text documents obtained from audio sources. Many clustering techniques suffer from drawbacks that may cause the algorithm to tend to sub optimal solutions, handling these drawbacks is essential to get better clustering results and avoid sub optimal solutions. The main target of our research is to enhance automatic topic clustering of transcribed speech documents, and examine the difference between implementing the K-means algorithm using our Initial Centroid Selection Optimization (ICSO) [16] with genetic algorithm optimization with Chi-square similarity measure to cluster a data set then use a self-organizing map to enhance the clustering process of the same data set, both techniques will be compared in terms of accuracy. The evaluation showed that using K-means with ICSO and genetic algorithm achieved the highest average accuracy.

show abstract

Section: K-means Clustering Algorithmmentioning

confidence: 99%

Self-Organizing Map vs Initial Centroid Selection Optimization to Enhance K-Means with Genetic Algorithm to Cluster Transcribed Broadcast News Documents

Maghawry

Omar

Badr

2019

IAJIT

View full text Add to dashboard Cite

show abstract

“…The parameters involved should be used cautiously as incompatible use of parameters of clustering like, Number of Clusters (k-means) and Density Limit, may lead to situations like improper density shape of clusters, ambiguity in finding centroid and the noise [5][6][7]. Mainly The improved semi supervised K mean clustering is used for the greedy iteration to find the K mean clustering is presented in [8]. In this work, modification of iterative objective function for semi supervised K clustering in dealing with multi-objective optimization problems of insufficient is illustrated.…”

Section: Introductionmentioning

confidence: 99%

Prognostic Kalman Filter Based Bayesian Learning Model for Data Accuracy Prediction

Karthik¹,

Bhadoria²,

Lee³

et al. 2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

Data is always a crucial issue of concern especially during its prediction and computation in digital revolution. This paper exactly helps in providing efficient learning mechanism for accurate predictability and reducing redundant data communication. It also discusses the Bayesian analysis that finds the conditional probability of at least two parametric based predictions for the data. The paper presents a method for improving the performance of Bayesian classification using the combination of Kalman Filter and K-means. The method is applied on a small dataset just for establishing the fact that the proposed algorithm can reduce the time for computing the clusters from data. The proposed Bayesian learning probabilistic model is used to check the statistical noise and other inaccuracies using unknown variables. This scenario is being implemented using efficient machine learning algorithm to perpetuate the Bayesian probabilistic approach. It also demonstrates the generative function for Kalman-filer based prediction model and its observations. This paper implements the algorithm using open source platform of Python and efficiently integrates all different modules to piece of code via Common Platform Enumeration (CPE) for Python.

show abstract

“…In the common methods of clustering, there is no previous information, and as such, it is called the unsupervised learning method [2,3]; however, in the real world, some information [4] is normally available, or we can obtain from Oracle. is information can be in different forms and can be used in the process of clustering [5][6][7][8][9][10][11][12][13].…”

Section: Introductionmentioning

confidence: 99%

“…If the information is presented as pairwise constraints (where a document pair must be in the same cluster (ML), while a document pair should not be located in the same cluster (CL)), and these pairwise constraints are used in the process of clustering, this method will be called pairwise constrained clustering [6,14,15]. Pairwise constraints can be useful in the clustering process in two ways: when enough informative pairwise constraints exist, where the accuracy and efficiency of the clustering can be improved, and when we want to change the process of clustering and personalize it [10,12,16].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Active Learning for Constrained Document Clustering with Uncertainty Region

2020

View full text Add to dashboard Cite

Constrained clustering is intended to improve accuracy and personalization based on the constraints expressed by an Oracle. In this paper, a new constrained clustering algorithm is proposed and some of the informative data pairs are selected during an iterative process. Then, they are presented to the Oracle and their relation is answered with “Must-link (ML) or Cannot-link (CL).” In each iteration, first, the support vector machine (SVM) is utilized based on the label produced by the current clustering. According to the distance of each document from the hyperplane, the distance matrix is created. Also, based on cosine similarity of word2vector of each document, the similarity matrix is created. Two types of probability (similarity and degree of similarity) are calculated and they are smoothed for belonging to neighborhoods. Neighborhoods form the samples that are labeled by Oracle, to be in the same cluster. Finally, at the end of each iteration, the data with a greater level of uncertainty (in term of probability) is selected for questioning the oracle. In order to evaluate, the proposed method is compared with famous state-of-the-art methods based on two criteria and over a standard dataset. The result demonstrates an increased accuracy and stability of the obtained result with fewer questions.

show abstract

RETRACTED ARTICLE: Research on semi supervised K-means clustering algorithm in data mining

Cited by 21 publications

References 9 publications

Self-Organizing Map vs Initial Centroid Selection Optimization to Enhance K-Means with Genetic Algorithm to Cluster Transcribed Broadcast News Documents

Self-Organizing Map vs Initial Centroid Selection Optimization to Enhance K-Means with Genetic Algorithm to Cluster Transcribed Broadcast News Documents

Prognostic Kalman Filter Based Bayesian Learning Model for Data Accuracy Prediction

Active Learning for Constrained Document Clustering with Uncertainty Region

Contact Info

Product

Resources

About