Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

Cesario, Eugenio; Manco, Giuseppe; Ortale, Riccardo

doi:10.1109/tkde.2007.190649

Cited by 63 publications

(51 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [7], OLIN, an online classification system, dynamically adjusts The recent research literature has proposed more tractable techniques for anomaly detection and classification [8,9,10,11]. These proposals rely on a common approach to data analysis: they apply dimensionality reduction techniques such as sketches [12,3] or principal components [13,14] to the aggregate network traffic.…”

Section: State Of the Artmentioning

confidence: 99%

Anomaly Clustering Based on Correspondence Analysis

Islam¹,

Ahmed

2018

2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA)

View full text Add to dashboard Cite

show abstract

Section: State Of the Artmentioning

confidence: 99%

Anomaly Clustering Based on Correspondence Analysis

Islam¹,

Ahmed

2018

2018 IEEE 32nd International Conference on Advanced Information Networking and Applications (AINA)

View full text Add to dashboard Cite

show abstract

“…The algorithm is particularly suitable for large high-dimensional databases, but it is sensitive to a user defined parameter (the repulsion factor), which weights the importance of the compactness/sparseness of a cluster. Other approaches [7], [8], [9], [10] extend the computation of frequencies to frequent patterns in the underlying data set. In particular, each transaction is seen as a relation over some sets of items, and a hyper-graph model is used for representing these relations.…”

Section: Related Workmentioning

confidence: 99%

Two Phase Iterative Clustering for Educational Data

Karad¹,

Halgaonkar²,

Wadhai³

et al. 2012

IJAIS

View full text Add to dashboard Cite

In the field of data mining, clustering of educational data has not given much of the importance. Considering the growth of educational field as a business, clustering of educational data must be focused as it can give effective results as in the case of mining enrolled students on the basis of education they undertake. A new algorithm is proposed and implemented by us for clustering educational data. This algorithm is based on a continuous looping procedure. Raw dataset is assigned to clustering algorithm initially and a novel cluster is identified for partition whose cluster high degree is less. Then improvement of degree of cluster is carried out. In this algorithm on the basis of homogeneity, cluster high degree is defined. Experiment is carried out on educational data; which provides good high degree clusters.

show abstract

“…In recent years there has been an increasing interest to analyze categorical data in a data warehouse context where data sets are rather large and may have a high number of categorical dimensions [4,6,8,15]. However, many traditional techniques associated to the exploration of data sets assume the attributes have continuous data (covariance, density functions, PCA, etc.).…”

Section: The Need To Encodementioning

confidence: 99%

Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study

Kuri-Morales

Trejo-Baños

Cortés-Berrueco

2011

Advances in Soft Computing

View full text Add to dashboard Cite

Abstract. The problem of finding clusters in arbitrary sets of data has been attempted using different approaches. In most cases, the use of metrics in order to determine the adequateness of the said clusters is assumed. That is, the criteria yielding a measure of quality of the clusters depends on the distance between the elements of each cluster. Typically, one considers a cluster to be adequately characterized if the elements within a cluster are close to one another while, simultaneously, they appear to be far from those of different clusters. This intuitive approach fails if the variables of the elements of a cluster are not amenable to distance measurements, i.e., if the vectors of such elements cannot be quantified. This case arises frequently in real world applications where several variables (if not most of them) correspond to categories. The usual tendency is to assign arbitrary numbers to every category: to encode the categories. This, however, may result in spurious patterns: relationships between the variables which are not really there at the offset. It is evident that there is no truly valid assignment which may ensure a universally valid numerical value to this kind of variables. But there is a strategy which guarantees that the encoding will, in general, not bias the results. In this paper we explore such strategy. We discuss the theoretical foundations of our approach and prove that this is the best strategy in terms of the statistical behavior of the sampled data. We also show that, when applied to a complex real world problem, it allows us to generalize soft computing methods to find the number and characteristics of a set of clusters. We contrast the characteristics of the clusters gotten from the automated method with those of the experts.

show abstract

Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

Cited by 63 publications

References 34 publications

Anomaly Clustering Based on Correspondence Analysis

Anomaly Clustering Based on Correspondence Analysis

Two Phase Iterative Clustering for Educational Data

Clustering of Heterogeneously Typed Data with Soft Computing - A Case Study

Contact Info

Product

Resources

About