2019
DOI: 10.21307/stattrans-2019-013
|View full text |Cite
|
Sign up to set email alerts
|

The Effect of Binary Data Transformation in Categorical Data Clustering

Abstract: This paper focuses on hierarchical clustering of categorical data and compares two approaches which can be used for this task. The first one, an extremely common approach, is to perform a binary transformation of the categorical variables into sets of dummy variables and then use the similarity measures suited for binary data. These similarity measures are well examined, and they occur in both commercial and non-commercial software. However, a binary transformation can possibly cause a loss of information in t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 14 publications
0
2
0
Order By: Relevance
“…In a clustering method, by default, the distance measures like Euclidean distance and Hamming distance are used in clustering methods such as hierarchical clustering. They perform well in most of the homogenous categorical data [18]. In heterogeneous data, the capability of entropy distances is offered.…”
Section: Entropy Distance Measurementioning
confidence: 99%
“…In a clustering method, by default, the distance measures like Euclidean distance and Hamming distance are used in clustering methods such as hierarchical clustering. They perform well in most of the homogenous categorical data [18]. In heterogeneous data, the capability of entropy distances is offered.…”
Section: Entropy Distance Measurementioning
confidence: 99%
“…Transformation: After preprocessing, the data is adjusted to an appropriate form that allows the implementation of the selected data mining technique, for this, different strategies are applied, such as the binarization of states in a variable or the methods of reducing dimensions that allow optimizing the data extraction algorithms that will be used on later stage, thus reducing the number of variables under consideration (Cibulková, Šulc, Sirota, & Řezanková, 2019).…”
Section: Kdd Processmentioning
confidence: 99%
“…La estructura general de la función similaridad S(X , Y ) se describe como (Boriah et al, 2008): Para efectos de este procedimiento, se consideró el conjunto de similaridades, diseñadas especialmente para datos categóricos, propuesto por Boriah et al (2008). Este tipo de medidas es muy ventajoso ya que conduce a mejores agrupamientos en comparación con otras medidas de similitud como las medidas de agrupación binaria (Cibulková et al, 2019). Luego de aplicar el proceso de selección se optó por utilizar dos medidas de similitud, una para cada objeto de interés o UTO recabada mediante CAQDAS.…”
Section: Selección De Una Medida De Similaridadunclassified