kamila: Clustering Mixed-Type Data in R and Hadoop

Foss, Alexander H.; Markatou, Marianthi

doi:10.18637/jss.v083.i13

Cited by 52 publications

(72 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For most data sets with multiple nominal variables, this inevitably leads to small sample sizes within each categorical cell. Consider two typical mixed‐type data sets analysed in Foss & Markatou (): the first, a biomedical data set contains five nominal variables measured on 475 patients, while the second contains five nominal variables measured on about 80 million domestic airline flights in the USA. The distribution of counts within the combinatorial cells for each data set is shown in Figure ; in the biomedical data set, the median number of observations per cell is two, and even in the much larger airline data set, 25% of the cells have count less than 16.…”

Section: Statistical Mixture Modelsmentioning

confidence: 99%

See 1 more Smart Citation

Distance Metrics and Clustering Methods for Mixed‐type Data

Foss

Markatou

Ray

2018

Int Statistical Rev

Self Cite

View full text Add to dashboard Cite

In spite of the abundance of clustering techniques and algorithms, clustering mixed interval (continuous) and categorical (nominal and/or ordinal) scale data remain a challenging problem.In order to identify the most effective approaches for clustering mixed-type data, we use both theoretical and empirical analyses to present a critical review of the strengths and weaknesses of the methods identified in the literature. Guidelines on approaches to use under different scenarios are provided, along with potential directions for future research.

show abstract

Section: Statistical Mixture Modelsmentioning

confidence: 99%

“…If an inadequate sample size is suspected, KAMILA incorporates a categorical smoother that can ameliorate these issues in most circumstances. The KAMILA method has been implemented in the R package kamila, as well as in Hadoop, with usage recommendations described in Foss & Markatou ().…”

Section: Statistical Mixture Modelsmentioning

confidence: 99%

Distance Metrics and Clustering Methods for Mixed‐type Data

Foss

Markatou

Ray

2018

Int Statistical Rev

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although Modha–Spangler clustering accounts for variable significance within the algorithm, it is vulnerable to individual noninformative variables, due to the fact that the single weight does not allow individual variables to be up‐ or downweighted (Foss, Markatou, Ray, & Heching, ). The Modha–Spangler algorithm is implemented in R package kamila (Foss & Markatou, ).…”

Section: Defining Dissimilarity Measures For Mixed Datamentioning

confidence: 99%

Distance‐based clustering of mixed data

Velden

D’Enza

Markos

2018

WIREs Computational Stats

View full text Add to dashboard Cite

Cluster analysis comprises of several unsupervised techniques aiming to identify a subgroup (cluster) structure underlying the observations of a data set. The desired cluster allocation is such that it assigns similar observations to the same subgroup. Depending on the field of application and on domain‐specific requirements, different approaches exist that tackle the clustering problem. In distance‐based clustering, a distance metric is used to determine the similarity between data objects. The distance metric can be used to cluster observations by considering the distances between objects directly or by considering distances between objects and cluster centroids (or some other cluster representative points). Most distance metrics, and hence the distance‐based clustering methods, work either with continuous‐only or categorical‐only data. In applications, however, observations are often described by a combination of both continuous and categorical variables. Such data sets can be referred to as mixed or mixed‐type data. In this review, we consider different methods for distance‐based cluster analysis of mixed data. In particular, we distinguish three different streams that range from basic data preprocessing (where all variables are converted to the same scale), to the use of specific distance measures for mixed data, and finally to so‐called joint data reduction (a combination of dimension reduction and clustering) methods specifically designed for mixed data. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Exploratory Data Analysis Statistical and Graphical Methods of Data Analysis > Dimension Reduction

show abstract

“…Clustering heterogenous dataset is a challenging process. The outcome of the analysis gives a significant impact on the interpretation of clusters [1,2,3,4]. Moreover, it demanded excessive computational skills and memory storage due to incorporation of broad categories [5].…”

Section: Introductionmentioning

confidence: 99%

“…The most common approached in treating heterogeneous data is through converting the variables into a single scale of measurement. However, this method may result in information loss [6,7,4]. Meanwhile, conducting a separate cluster analysis can abandon the connection between the variables which can be inappropriate.…”

Section: Introductionmentioning

confidence: 99%

Investigation on the Clusterability of Heterogeneous Dataset by Retaining the Scale of Variables

Shamsuddin¹,

Mahat²

2019

View full text Add to dashboard Cite

Clustering with heterogeneous variables in a dataset is no doubt a challenging process owing to different scales in a data. The paper introduced a SimMultiCorrData package in R to generate the artificial dataset for clustering. The construction of artificial dataset with various distribution helps to mimic the scenario of nature of real datasets. Our experiments shows that the clusterability of a dataset are influenced by various factors such as overlapping clusters, noise, sub-cluster, and unbalance objects within the clusters.

show abstract

kamila: Clustering Mixed-Type Data in R and Hadoop

Cited by 52 publications

References 40 publications

Distance Metrics and Clustering Methods for Mixed‐type Data

Distance Metrics and Clustering Methods for Mixed‐type Data

Distance‐based clustering of mixed data

Investigation on the Clusterability of Heterogeneous Dataset by Retaining the Scale of Variables

Contact Info

Product

Resources

About