LIMBO: Scalable Clustering of Categorical Data

Andritsos, Periklis; Tsaparas, Panayiotis; Miller, Renée J.; Sevcik, Kenneth C.

doi:10.1007/978-3-540-24741-8_9

Cited by 197 publications

(180 citation statements)

References 12 publications

Supporting

Mentioning

169

Contrasting

Unclassified

Order By: Relevance

“…A text document can be represented either in the form of binary data, when we use the presence or absence of a word in the document in order to create a binary vector. In such cases, it is possible to directly use a variety of categorical data clustering algorithms [10,41,43] on the binary representation. A more enhanced representation would include refined weighting methods based on the frequencies of the individual words in the document as well as frequencies of words in an entire collection (e.g., TF-IDF weighting [82]).…”

Section: Document Classificationmentioning

confidence: 99%

“…Traditional methods for clustering have generally focussed on the case of quantitative data [44,71,50,54,108], in which the attributes of the data are numeric. The problem has also been studied for the case of categorical data [10,41,43], in which the attributes may take on nominal values. A broad overview of clustering (as it relates to generic numerical and categorical data) may be found in [50,54].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey of Text Clustering Algorithms

2012

View full text Add to dashboard Cite

Clustering is a widely studied data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this chapter, we will provide a detailed survey of the problem of text clustering. We will study the key challenges of the clustering problem, as it applies to the text domain. We will discuss the key methods used for text clustering, and their relative advantages. We will also discuss a number of recent advances in the area in the context of social network and linked data.

show abstract

Section: Document Classificationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Survey of Text Clustering Algorithms

2012

View full text Add to dashboard Cite

show abstract

“…It composed of the attribute values with high co-occurrence. In the statistical categorical clustering algorithms [12], [13] such as COOLCAT and LIMBO, data points are grouped based on the statistics. In algorithm COOLCAT, data points are separated in such a way that the expected entropy of the whole arrangements is minimized.…”

Section: Related Workmentioning

confidence: 99%

Clustering of Concept Drift Categorical Data using Our-NIR Method

Raju¹,

Reddy²,

Sunitha³

et al. 2011

IJCEE

View full text Add to dashboard Cite

Abstract-In the clustering using Ming-Syan Chen NIR method has deficiency that is importance of data labeling and outlier detection. The Our-NIR method introduced to improve Ming-Syan Chen method. In this paper the newly introduced method is taken for comparison to improve the cluster efficiency. To improve the efficiency of clustering by the sampling techniques. However, with sampling applied, those sampled points that are not having their labels after the normal process. Even though there is straight forward method for numerical domain and categorical data. But still it has a problem that is how to allocate those unlabeled data points into appropriate clusters in efficient manner. In this paper the concept-drift phenomenon is studied, and we first propose an adaptive threshold for outlier detection, which is a playing vital role detection of cluster. Second, probabilistic approaches for detection of cluster are proposed using Our-NIR method.

show abstract

“…The K-Modes [3] algorithm is an extension of the K-means algorithm for categorical data. General description: The K-Modes algorithm was designed to group large sets of categorical data and its purpose is to obtain K-modes representing the data set and minimizing the criterion function.…”

Section: K-modes Algorithmmentioning

confidence: 99%

“…This constitutes a frequent problem in data mining applications, which work with high volumes of data. The presence of categorical data is also frequent.There are clustering algorithms [3] [4] [5] that work with large databases and categorical data, like ROCK [6] clustering algorithm, which deals with the size of databases by working with a database random sample. However, the algorithm is highly impacted by size of the sample and randomness.…”

Section: Introductionmentioning

confidence: 99%

Data Reduction Method for Categorical Data Clustering

Rendón

Sánchez

Garcia

et al.

Advances in Artificial Intelligence – IBERAMIA 2008

View full text Add to dashboard Cite

Abstract. Categorical data clustering constitutes an important part of data mining; its relevance has recently drawn attention from several researchers. As a step in data mining, however, clustering encounters the problem of large amount of data to be processed. This article offers a solution for categorical clustering algorithms when working with high volumes of data by means of a method that summarizes the database. This is done using a structure called CM-tree. In order to test our method, the KModes and Click clustering algorithms were used with several databases. Experiments demonstrate that the proposed summarization method improves execution time, without losing clustering quality.

show abstract

LIMBO: Scalable Clustering of Categorical Data

Cited by 197 publications

References 12 publications

A Survey of Text Clustering Algorithms

A Survey of Text Clustering Algorithms

Clustering of Concept Drift Categorical Data using Our-NIR Method

Data Reduction Method for Categorical Data Clustering

Contact Info

Product

Resources

About