Abstract:I n several real-life data-mining applications, data reside in very high (1000 or more) dimensional space, where both clustering techniques developed for low-dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan, etc.) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship-based approach that alleviates both problems, side-stepping the "curseof-dimensionality" issue by working in a suitable similarity space in… Show more
“…To evaluate the results of the test cases we used external quality measures described in [Str02], purity, F-measure, entropy and mutual information. These measures are defined as follows.…”
Abstract. Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases. As the main steps the method contains specification of grouping rules, pairwise grouping between entries, actual grouping of similar entries, and evaluation and analysis of the results. Often, different strategies can be used in the different steps. The method enables exploration of the influence of the choices and supports evaluation of the results with respect to given classifications. The grouping method is illustrated by test cases based on different strategies and classifications. The results show the complexity of the similarity-based grouping tasks and give deeper insights in the selected grouping tasks, the analyzed data source, and the influence of different strategies on the results.
“…To evaluate the results of the test cases we used external quality measures described in [Str02], purity, F-measure, entropy and mutual information. These measures are defined as follows.…”
Abstract. Similarity-based grouping of data entries in one or more data sources is a task underlying many different data management tasks, such as, structuring search results, removal of redundancy in databases and data integration. Similarity-based grouping of data entries is not a trivial task in the context of life science data sources as the stored data is complex, highly correlated and represented at different levels of granularity. The contribution of this paper is two-fold. 1) We propose a method for similarity-based grouping and 2) we show results from test cases. As the main steps the method contains specification of grouping rules, pairwise grouping between entries, actual grouping of similar entries, and evaluation and analysis of the results. Often, different strategies can be used in the different steps. The method enables exploration of the influence of the choices and supports evaluation of the results with respect to given classifications. The grouping method is illustrated by test cases based on different strategies and classifications. The results show the complexity of the similarity-based grouping tasks and give deeper insights in the selected grouping tasks, the analyzed data source, and the influence of different strategies on the results.
“…[Strehl, 2002] compares several metrics according to their different biases and scaling properties: purity and entropy are extreme cases where the bias is towards small clusters, because they reach a maximal value when all clusters are of size one. Combining precision and recall via a balanced F measure, on the other hand, favors coarser clusterings, and random clusterings do not receive zero values (which is a scaling problem).…”
There is a wide set of evaluation metrics available to compare the quality of text clustering algorithms. In this article, we define a few intuitive formal constraints on such metrics which shed light on which aspects of the quality of a clustering are captured by different metric families. These formal constraints are validated in an experiment involving human assessments, and compared with other constraints proposed in the literature. Our analysis of a wide range of metrics shows that only BCubed satisfies all formal constraints.We also extend the analysis to the problem of overlapping clustering, where items can simultaneously belong to more than one cluster. As Bcubed cannot be directly applied to this task, we propose a modified version of Bcubed that avoids the problems found with other metrics.
“…To compute purity (Strehl 2002), each induced phone ω l is assigned to the labeled phone c i whose frames are most frequent in ω l , and then the accuracy of this assignment is measured by counting the number of correctly assigned frames and dividing by the total number of frames N :…”
Section: Evaluation Measures For Segmentationmentioning
In this paper, we develop a new conceptual framework for an important problem in language acquisition, the correspondence problem: the fact that a given utterance has different manifestations in the speech and articulation of different speakers and that the correspondence of these manifestations is difficult to learn. We put forward the Correspondence-by-Segmentation Hypothesis, which states that correspondence is primarily learned by first segmenting speech in an unsupervised manner and then mapping the acoustics of different speakers onto each other. We show that a rudimentary segmentation of speech can be learned in an unsupervised fashion. We then demonstrate that, using the previously learned segmentation, different instances of a word can be mapped onto each other with high accuracy when trained on utterance-label pairs for a small set of words.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.