We present the R package clustrd which implements a class of methods that combine dimension reduction and clustering of continuous or categorical data. In particular, for continuous data, the package contains implementations of factorial K-means and reduced K-means; both methods combine principal component analysis with K-means clustering. For categorical data, the package provides MCA K-means, i-FCB and cluster correspondence analysis, which combine multiple correspondence analysis with K-means. Two examples on real data sets are provided to illustrate the usage of the main functions.
Cluster analysis comprises of several unsupervised techniques aiming to identify a subgroup (cluster) structure underlying the observations of a data set. The desired cluster allocation is such that it assigns similar observations to the same subgroup. Depending on the field of application and on domain‐specific requirements, different approaches exist that tackle the clustering problem. In distance‐based clustering, a distance metric is used to determine the similarity between data objects. The distance metric can be used to cluster observations by considering the distances between objects directly or by considering distances between objects and cluster centroids (or some other cluster representative points). Most distance metrics, and hence the distance‐based clustering methods, work either with continuous‐only or categorical‐only data. In applications, however, observations are often described by a combination of both continuous and categorical variables. Such data sets can be referred to as mixed or mixed‐type data. In this review, we consider different methods for distance‐based cluster analysis of mixed data. In particular, we distinguish three different streams that range from basic data preprocessing (where all variables are converted to the same scale), to the use of specific distance measures for mixed data, and finally to so‐called joint data reduction (a combination of dimension reduction and clustering) methods specifically designed for mixed data. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Statistical Learning and Exploratory Methods of the Data Sciences > Exploratory Data Analysis Statistical and Graphical Methods of Data Analysis > Dimension Reduction
A method is proposed that combines dimension reduction and cluster analysis for categorical data by simultaneously assigning individuals to clusters and optimal scaling values to categories in such a way that a single between variance maximization objective is achieved. In a unified framework, a brief review of alternative methods is provided and we show that the proposed method is equivalent to GROUPALS applied to categorical data. Performance of the methods is appraised by means of a simulation study. The results of the joint dimension reduction and clustering methods are compared with the so-called tandem approach, a sequential analysis of dimension reduction followed by cluster analysis. The tandem approach is conjectured to perform worse when variables are added that are unrelated to the cluster structure. Our simulation study confirms this conjecture. Moreover, the results of the simulation study indicate that the proposed method also consistently outperforms alternative joint dimension reduction and clustering methods.
In a relatively short period of time, social media have acquired a prominent role in media and daily life. Although this development brought about several academic endeavors, the literature concerning the analysis of social media data to investigate one's customer base appears to be limited. In this paper, we show how data from the social network site Facebook can be operationalized to gain insight into the individuals connected to a company's Facebook site. In particular, we propose a data collection framework to obtain individual specific data and propose methodology to explore user profiles and identify segments based on these profiles. The proposed data collection framework can be used as an identification step in an analytical customer relationship management implementation that specifically focuses on potential customers. We illustrate our methodology by applying it to the Facebook page of an internationally well-known professional football (soccer) club. In our analysis, we identify four clusters of users that differ with respect to their indicated "liking" profiles.
Multidimensional scaling is a statistical technique to visualize dissimilarity data. In multidimensional scaling, objects are represented as points in a usually two dimensional space, such that the distances between the points match the observed dissimilarities as closely as possible. Here, we discuss what kind of data can be used for multidimensional scaling, what the essence of the technique is, how to choose the dimensionality, transformations of the dissimilarities, and some pitfalls to watch out for when using multidimensional scaling.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.