Clustering mixed numerical and categorical data with missing values

Dinh, Duy-Tai; Huynh, Van–Nam; Sriboonchitta, Songsak

doi:10.1016/j.ins.2021.04.076

Cited by 72 publications

(29 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Any missing value of the variables representing monthly average temperatures, total monthly rainfalls, days with hail events in a month, days with fog occurrence in a month and days with storm activity in a month were calculated as the average of all the values registered for the corresponding month for all the years included in the dataset. In this context, notice that there are novel proposals for data-mining methods that inherently deal with missing values [21].…”

Section: Data Preprocessing On the Cloudmentioning

confidence: 99%

Internet of Things-Driven Data Mining for Smart Crop Production Prediction in the Peasant Farming Domain

et al. 2022

View full text Add to dashboard Cite

Internet of Things (IoT) technologies can greatly benefit from machine-learning techniques and artificial neural networks for data mining and vice versa. In the agricultural field, this convergence could result in the development of smart farming systems suitable for use as decision support systems by peasant farmers. This work presents the design of a smart farming system for crop production, which is based on low-cost IoT sensors and popular data storage services and data analytics services on the cloud. Moreover, a new data-mining method exploiting climate data along with crop-production data is proposed for the prediction of production volume from heterogeneous data sources. This method was initially validated using traditional machine-learning techniques and open historical data of the northeast region of the state of Puebla, Mexico, which were collected from data sources from the National Water Commission and the Agri-food Information Service of the Mexican Government.

show abstract

Section: Data Preprocessing On the Cloudmentioning

confidence: 99%

Internet of Things-Driven Data Mining for Smart Crop Production Prediction in the Peasant Farming Domain

et al. 2022

View full text Add to dashboard Cite

show abstract

“…In the following, the effective application of HAC on inhomogeneous data of PUH-environments is presented, leading to the obvious suggestion that HAC could be far more intensely used for building classification tasks in urban planning. For a detailed description of a novel clustering algorithm (k-CMM), especially suited to mixed numerical and categorical data, see e.g., Dinh et al [24].…”

Section: Data/buildings Classification and Applied Clustering Algorithmsmentioning

confidence: 99%

Application of Hierarchical Agglomerative Clustering (HAC) for Systemic Classification of Pop-Up Housing (PUH) Environments

2021

View full text Add to dashboard Cite

This paper is the result of the first-phase, inter-disciplinary work of a multi-disciplinary research project (“Urban pop-up housing environments and their potential as local innovation systems”) consisting of energy engineers and waste managers, landscape architects and spatial planners, innovation researchers and technology assessors. The project is aiming at globally analyzing and describing existing pop-up housings (PUH), developing modeling and assessment tools for sustainable, energy-efficient and socially innovative temporary housing solutions (THS), especially for sustainable and resilient urban structures. The present paper presents an effective application of hierarchical agglomerative clustering (HAC) for analyses of large datasets typically derived from field studies. As can be shown, the method, although well-known and successfully established in (soft) computing science, can also be used very constructively as a potential urban planning tool. The main aim of the underlying multi-disciplinary research project was to deeply analyze and structure THS and PUE. Multiple aspects are to be considered when it comes to the characterization and classification of such environments. A thorough (global) web survey of PUH and analysis of scientific literature concerning descriptive work of PUH and THS has been performed. Moreover, out of several tested different approaches and methods for classifying PUH, hierarchical clustering algorithms functioned well when properly selected metrics and cut-off criteria were applied. To be specific, the ‘Minkowski’-metric and the ‘Calinski-Harabasz’-criteria, as clustering indices, have shown the best overall results in clustering the inhomogeneous data concerning PUH. Several additional algorithms/functions derived from the field of hierarchical clustering have also been tested to exploit their potential in interpreting and graphically analyzing particular structures and dependencies in the resulting clusters. Hereby, (math.) the significance ‘S’ and (math.) proportion ‘P’ have been concluded to yield the best interpretable and comprehensible results when it comes to analyzing the given set (objects n = 85) of researched PUH-objects together with their properties (n > 190). The resulting easily readable graphs clearly demonstrate the applicability and usability of hierarchical clustering- and their derivative algorithms for scientifically profound building classification tasks in Urban Planning by effectively managing huge inhomogeneous building datasets.

show abstract

“…(1) In OCS approach, the missing values are viewed as additional attributes to be optimized and then impute missing values at each iteration till it reaches the best estimates, (2) NPS is a OCS modification, which computes the partial distances, and missing values are estimated by their nearest prototype counterparts during each iteration. In the hybrid clustering-based imputation method, Dinh et al [32] proposed a framework of clustering mixed numerical and categorical data with missing values, it used the decision-tree-based method to find the set of correlated data instance and used the mean and kernel-based methods to obtain cluster centers at numerical and categorical attributes, and they applied the dissimilarity measure to calculate the distances between instance and cluster centers.…”

Section: Computational Intelligence Imputationmentioning

confidence: 99%

A novel clustering-based purity and distance imputation for handling medical data with missing values

Cheng

Huang

2021

Soft Comput

View full text Add to dashboard Cite

Nowadays, people pay increasing attention to health, and the integrity of medical records has been put into focus. Recently, medical data imputation has become a very active field because medical data usually have missing values. Many imputation methods have been proposed, but many model-based imputation methods such as expectation-maximization and regression-based imputation based on the variables data have a multivariate normal distribution, which assumption can lead to biased results. Sometimes this becomes a bottleneck, such as computationally more complex than model-free methods. Furthermore, directly remove instances with missing values, this approach has several problems, and it is possible to lose the important data, produce ineffective research samples, and cause research deviations, and so on.Therefore, this study proposes a novel clustering-based purity and distance imputation method to improve the handling of missing values. In the experiment, we collected eight different medical datasets to compare the proposed imputation methods with the listed imputation methods with regard to the results of different situations. In imputation measures, the area under the curve (AUC) is used to evaluate the performance of the imbalanced class datasets in MAR and MCAR experiments, and accuracy is applied to measure its performance of the balanced class in MNAR experiment. Finally, the root-mean-square error (RMSE) is also used to compare the proposed and the listing imputation methods. In addition, this study utilized the elbow method and the average silhouette method to find the optimal number of clusters for all datasets. Results showed that the proposed imputation method could improve imputation performance in the accuracy, AUC, and RMSE of different missing degrees and missing types.

show abstract

Clustering mixed numerical and categorical data with missing values

Cited by 72 publications

References 28 publications

Internet of Things-Driven Data Mining for Smart Crop Production Prediction in the Peasant Farming Domain

Internet of Things-Driven Data Mining for Smart Crop Production Prediction in the Peasant Farming Domain

Application of Hierarchical Agglomerative Clustering (HAC) for Systemic Classification of Pop-Up Housing (PUH) Environments

A novel clustering-based purity and distance imputation for handling medical data with missing values

Contact Info

Product

Resources

About