DIC-DOC-<i>K</i>-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using <i>K</i>-means for improving the effectiveness of text document clustering

Lakshmi, R. Deepa; Baskar, S.

doi:10.1177/0165551518816302

Cited by 19 publications

(14 citation statements)

References 32 publications

(54 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, two essential changes have been applied to PCKmeans: (i) in the section of initialization and (ii) in the section of calculating centers of clusters. Furthermore, the penalty of violation from constraints is created automatically [12,47].…”

Section: Methodsmentioning

confidence: 99%

Active Learning for Constrained Document Clustering with Uncertainty Region

2020

View full text Add to dashboard Cite

Constrained clustering is intended to improve accuracy and personalization based on the constraints expressed by an Oracle. In this paper, a new constrained clustering algorithm is proposed and some of the informative data pairs are selected during an iterative process. Then, they are presented to the Oracle and their relation is answered with “Must-link (ML) or Cannot-link (CL).” In each iteration, first, the support vector machine (SVM) is utilized based on the label produced by the current clustering. According to the distance of each document from the hyperplane, the distance matrix is created. Also, based on cosine similarity of word2vector of each document, the similarity matrix is created. Two types of probability (similarity and degree of similarity) are calculated and they are smoothed for belonging to neighborhoods. Neighborhoods form the samples that are labeled by Oracle, to be in the same cluster. Finally, at the end of each iteration, the data with a greater level of uncertainty (in term of probability) is selected for questioning the oracle. In order to evaluate, the proposed method is compared with famous state-of-the-art methods based on two criteria and over a standard dataset. The result demonstrates an increased accuracy and stability of the obtained result with fewer questions.

show abstract

Section: Methodsmentioning

confidence: 99%

Active Learning for Constrained Document Clustering with Uncertainty Region

2020

View full text Add to dashboard Cite

show abstract

“…The Euclidean distance is used to calculate the distance between other samples and the cluster center, and the sample points are grouped into the class with the closest distance to the cluster center. Then, the mean value of each class is used as the new clustering center, and the samples are re-classified into k classes [48,49]. Thus, iterative calculations are performed until the cluster centroids no longer change.…”

Section: Evaluation Of Uniformity Based On Cluster Analysis 231 Clustering Analysis Algorithmmentioning

confidence: 99%

Evaluation of the Uniformity of Protective Coatings on Concrete Structure Surfaces Based on Cluster Analysis

Liu

Zhang

Tang

et al. 2021

Sensors

View full text Add to dashboard Cite

With the continuous development of urbanization and industrialization in the world, concrete is widely used in various engineering constructions as an engineering material. However, the consequent problem of durability of concrete structures is also becoming increasingly prominent. As an important additional measure, a protective coating can effectively improve the durability of concrete performance. Moreover, the uniformity of the concrete surface coating will directly affect its protective effect. Therefore, we propose a nondestructive inspection and evaluation method of coating uniformity based on infrared imaging and cluster analysis for concrete surface coating uniformity detection and evaluation. Based on the obtained infrared images, a series of processing and analysis of the images were carried out using MATLAB software to obtain the characteristics of the infrared images of the concrete surface. Finally, by extracting the temperature distribution data of the pixel points on the concrete surface, an evaluation method of concrete surface coating uniformity based on a combination of cluster analysis and hierarchical analysis was established. The evaluation results show that the determination results obtained by this method are consistent with the actual situation. This study has a positive contribution to the testing of concrete surface coating uniformity and its evaluation.

show abstract

“…It is compared with the existing U-K mean method. Lakshmi and Baskar [22] proposed a new initial centroid selection method of K-means document clustering algorithm, namely, DIC doc-K-means initial centroid selection based on dissimilarity, to improve the performance of text document clustering.…”

Section: Introductionmentioning

confidence: 99%

Optimization of Human Resource Performance Management System Based on Improved R‐Means Clustering Algorithm

Wang

2022

Journal of Mathematics

View full text Add to dashboard Cite

With the rapid development of network technology and database technology, computers have been able to store large-scale and massive data. On the other hand, traditional data analysis and processing tools such as management information system can only process these data on the surface, but the deeper data analysis ability is not satisfactory. The contradiction between data supply ability and data analysis ability is becoming more and more prominent, so there is an urgent need for an automation technology that can deeply process data. Data mining technology came into being. Cluster analysis, as an important topic in data mining, is a data mining method that divides data into natural groups and gives the description of the characteristics of each group. It is a basic method of data mining and knowledge discovery. Cluster analysis is a data mining technology for unsupervised classification of data without prior knowledge and guidance. Through the appropriate use of advanced algorithms, it can explore the hidden valuable information, improve the quality of data analysis and interpretation, and provide a scientific judgment basis for the reprocessing or understanding of data by other data analysis and sorting tools. First, this paper briefly introduces the principle, development, and methods of cluster analysis and expounds the application of cluster analysis. Then it expounds the principle of R-means clustering algorithm, analyzes the advantages and disadvantages of basic R-means clustering algorithm, and expounds several existing improvement methods. An improved R-means clustering algorithm and a clustering analysis model based on R-means clustering algorithm are proposed, and the corresponding algorithm flow and implementation are given.

show abstract

DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

Cited by 19 publications

References 32 publications

Active Learning for Constrained Document Clustering with Uncertainty Region

Active Learning for Constrained Document Clustering with Uncertainty Region

Evaluation of the Uniformity of Protective Coatings on Concrete Structure Surfaces Based on Cluster Analysis

Optimization of Human Resource Performance Management System Based on Improved R‐Means Clustering Algorithm

Contact Info

Product

Resources

About