Initial Seed Selection for Mixed Data Using Modified K-means Clustering Algorithm

Sajidha, S. A.; Desikan, Kalyani

doi:10.1007/s13369-019-04121-0

Cited by 9 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, a discussion of various existing techniques is presented, which are used to cluster the categorical data. The advantages and its limitations of these existing techniques [14][15][16][17][18] are also illustrated.…”

Section: Literature Reviewmentioning

confidence: 99%

See 1 more Smart Citation

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

Kumar¹,

Kanavalli²

2021

IJIES

View full text Add to dashboard Cite

Clustering plays a major role in the data mining application, because it divides and groups the data effectively. In the pattern analysis, two major challenges occur in real-life applications that includes handling the categorical data and the availability of correctly labeled data. According to the characteristics of homogeneity, the clustering techniques are designed to group the unlabeled data. Some important issues such as high memory utilization, time consumption, overhead, computation complexity and less effective results are present in various existing algorithms of numerical data. Therefore, the research study implemented clustering techniques based on the similarity of categorical data. Simultaneously, the attributes of inter and intra-clusters' similarities are identified, and then the performance of proposed method is improved by integrating those similarities. The noises are also removed by performing the pre-processing techniques, so the similarity between noise-free elements are estimated. Once these similarities are identified, the insignificant attributes are removed and the relevant attributes are chosen from the preprocessed elements. The overhead is reduced by developing the Similarity-based K-means Clustering (SKC) approach for clustering the attributes that depends on divergence distance. The efficiency of SKC is tested in the experimental analysis by means of precision, f-measure, accuracy, error rate of clustering and recall. The results state that the developed study achieved 98.45% accuracy for the publicly available dataset when comparing with the existing techniques: variations of Particle Swarm Optimization (PSO) and semi-supervised clustering system.

show abstract

Section: Literature Reviewmentioning

confidence: 99%

“…Sajidha [17] proposed the modified K-means clustering algorithm by considering every attribute of datasets for selecting the initial seed. The datasets were clustered with the mixed attributes easily because the developed study was independent of user-defined parameters.…”

Section: Literature Reviewmentioning

confidence: 99%

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

Kumar¹,

Kanavalli²

2021

IJIES

View full text Add to dashboard Cite

show abstract

“…Different initial seeds may lead to distinct results. It is also difficult to determine the number of clusters due to its nature of a supervised algorithm ( Sajidha et al, 2020 ). By applying the dummy variable for qualitative traits, k -means can be modified as weighted k -means clustering ( Huang et al, 2005 ; Foss and Markatou, 2018 ), so that it is able to deal with qualitative and quantitative traits at the same time.…”

Section: Introductionmentioning

confidence: 99%

A Modified Roger’s Distance Algorithm for Mixed Quantitative–Qualitative Phenotypes to Establish a Core Collection for Taiwanese Vegetable Soybeans

Kao

Wang

et al. 2021

Front. Plant Sci.

View full text Add to dashboard Cite

Vegetable soybeans [Glycine max (L.) Merr.] have characteristics of larger seeds, less beany flavor, tender texture, and green-colored pods and seeds. Rich in nutrients, vegetable soybeans are conducive to preventing neurological disease. Due to the change of dietary habits and increasing health awareness, the demand for vegetable soybeans has increased. To conserve vegetable soybean germplasms in Taiwan, we built a core collection of vegetable soybeans, with minimum accessions, minimum redundancy, and maximum representation. Initially, a total of 213 vegetable soybean germplasms and 29 morphological traits were used to construct the core collection. After redundant accessions were removed, 200 accessions were retained as the entire collection, which was grouped into nine clusters. Here, we developed a modified Roger’s distance for mixed quantitative–qualitative phenotypes to select 30 accessions (denoted as the core collection) that had a maximum pairwise genetic distance. No significant differences were observed in all phenotypic traits (p-values > 0.05) between the entire and the core collections, except plant height. Compared to the entire collection, we found that most traits retained diversities, but seven traits were slightly lost (ranged from 2 to 9%) in the core collection. The core collection demonstrated a small percentage of significant mean difference (3.45%) and a large coincidence rate (97.70%), indicating representativeness of the entire collection. Furthermore, large values in variable rate (149.80%) and coverage (92.5%) were in line with high diversity retained in the core collection. The results suggested that phenotype-based core collection can retain diversity and genetic variability of vegetable soybeans, providing a basis for further research and breeding programs.

show abstract

“…Hence, methods to detect outliers and to moderate their effects are needed. The importance of density and distance of data points while identifying the initial seed points for -means for numerical data, -modes for categorical data and mixed datasets using modified -means algorithm, is elucidated in the work [2]- [4] in which the initial seed points were effectively identified. One of the major drawbacks of the partition based clustering algorithm is that they cannot detect the presence of outliers.…”

Section: Introductionmentioning

confidence: 99%

A simple, effective distance and density based outlier detection algorithm

Sajidha

Agarwal

Pruthviraj

et al. 2021

IJEECS

View full text Add to dashboard Cite

Outliers are eccentric data points with anomalous nature. Clustering with outliers has received a lot of attention in the data processing community. But, they inordinately affect the quality of the results obtained in case of popular clustering algorithms during the process of finding an optimal solution. In this work, we propose a novel method to classify the data points with grouping characteristics as either an outlier or not. We use both distance and density of a particular data point with respect to the rest of the data points for this process. Distances are used to find the points at the extremities while the densities are used to identify the data points at the sparsest spaces. Further, every data model has to take into account the aspect of generalization in order to work robustly even in out of the box situations. Hence, our approach provides a generalization aspect to the model. The accuracy of the proposed work is measured using area under curve (AUC) was found the highest for cardioto data set -AUC value-0.90 and second highest AUC value was obtained for Spambase data set -0.52 and several other datasets are used to demonstrate the usage of the model proposed.

show abstract

Initial Seed Selection for Mixed Data Using Modified K-means Clustering Algorithm

Cited by 9 publications

References 10 publications

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

A Similarity based K-Means Clustering Technique for Categorical Data in Data Mining Application

A Modified Roger’s Distance Algorithm for Mixed Quantitative–Qualitative Phenotypes to Establish a Core Collection for Taiwanese Vegetable Soybeans

A simple, effective distance and density based outlier detection algorithm

Contact Info

Product

Resources

About