Relationship-Based Clustering and Visualization for High-Dimensional Data Mining

Strehl, Alexander L.; Ghosh, Joydeep

doi:10.1287/ijoc.15.2.208.14448

Cited by 167 publications

(191 citation statements)

References 45 publications

Supporting

Mentioning

188

Contrasting

Unclassified

Order By: Relevance

“…To evaluate the results of the test cases we used external quality measures described in [Str02], purity, F-measure, entropy and mutual information. These measures are defined as follows.…”

Section: Discussionmentioning

confidence: 99%

A Method for Similarity-Based Grouping of Biological Data

Jakoniené

Rundqvist

Lambrix

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases. As the main steps the method contains specification of grouping rules, pairwise grouping between entries, actual grouping of similar entries, and evaluation and analysis of the results. Often, different strategies can be used in the different steps. The method enables exploration of the influence of the choices and supports evaluation of the results with respect to given classifications. The grouping method is illustrated by test cases based on different strategies and classifications. The results show the complexity of the similarity-based grouping tasks and give deeper insights in the selected grouping tasks, the analyzed data source, and the influence of different strategies on the results.

show abstract

“…To evaluate the results of the test cases we used external quality measures described in [Str02], purity, F-measure, entropy and mutual information. These measures are defined as follows.…”

Section: Discussionmentioning

confidence: 99%

A Method for Similarity-Based Grouping of Biological Data

Jakoniené

Rundqvist

Lambrix

2006

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…[Strehl, 2002] compares several metrics according to their different biases and scaling properties: purity and entropy are extreme cases where the bias is towards small clusters, because they reach a maximal value when all clusters are of size one. Combining precision and recall via a balanced F measure, on the other hand, favors coarser clusterings, and random clusterings do not receive zero values (which is a scaling problem).…”

Section: Motivationmentioning

confidence: 99%

A comparison of extrinsic clustering evaluation metrics based on formal constraints

et al. 2008

View full text Add to dashboard Cite

There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints.We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed that avoids the problems found with other metrics.

show abstract

“…To compute purity (Strehl 2002), each induced phone ω l is assigned to the labeled phone c i whose frames are most frequent in ω l , and then the accuracy of this assignment is measured by counting the number of correctly assigned frames and dividing by the total number of frames N :…”

Section: Evaluation Measures For Segmentationmentioning

confidence: 99%

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Duran

Schütze

Möbius

et al. 2010

Res on Lang and Comput

View full text Add to dashboard Cite

In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using the previously learned segmentation, different instances of a word can be mapped onto each other with high accuracy when trained on utterance-label pairs for a small set of words.

show abstract

Relationship-Based Clustering and Visualization for High-Dimensional Data Mining

Cited by 167 publications

References 45 publications

A Method for Similarity-Based Grouping of Biological Data

A Method for Similarity-Based Grouping of Biological Data

A comparison of extrinsic clustering evaluation metrics based on formal constraints

A Computational Model of Unsupervised Speech Segmentation for Correspondence Learning

Contact Info

Product

Resources

About