A New Distance Metric for Unsupervised Learning of Categorical Data

Jia, Hong; Cheung, Yiu-ming; Liu, Jiming

doi:10.1109/tnnls.2015.2436432

Cited by 86 publications

(24 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The result illustrated the presented approach offered better generalization. A new distance metric for processing the categorical data was utilized in this work by using an unsupervised learning technique [29]. Also, various distance metrics have been investigated in this work, which included hamming distance, modified value difference metric, Ahmad's distance metric, association based distance metric, and content-based distance metric.…”

Section: Clusteringmentioning

confidence: 99%

An Efficient Technique for Disease Prediction by Using Enhanced Machine Learning Algorithms for Categorical Medical Dataset

Anusuya

Gomathi²

2021

ITC

View full text Add to dashboard Cite

In the 20th century, it is evident that there is a massive evolution of chronic diseases. The data mining approaches beneficial in making some medicinal decisions for curing diseases. But medical data may consist of a large number of data, which makes the prediction process a very difficult one. Also, in the medical field, the dataset may involve both the small database and extensive database. This creates the study of a complex one for disease prediction mechanism. Hence, in this paper, we intend to use a practical machine learning approach for disease prediction of both large and small datasets. Among the various machine learning procedures, classification, and clusters method play a significant role. Therefore, we introduced the enhanced classification and clusters approach in this work for obtaining better accuracy results for disease prediction. In this proposed method, a process of preprocessing is involved, followed by Eigen vector extraction, feature selection, and classification Further, the most suitable features are selected with the use of Multi-Objective based Ant Colony Optimization (MO-ACO) from the extracted features for increasing the classification and clusters. Here we have shown the novelty in every stage of the implementation, such as feature selection, feature extraction, and the final prediction stage. The proposed method will be compared with the existing technique on the measure of precision, NMI, execution time, recall, and Accuracy. Here we conclude with the solution having more accuracy for both small and large datasets.

show abstract

Section: Clusteringmentioning

confidence: 99%

An Efficient Technique for Disease Prediction by Using Enhanced Machine Learning Algorithms for Categorical Medical Dataset

Anusuya

Gomathi²

2021

ITC

View full text Add to dashboard Cite

show abstract

“…The main drawback of metrics based on co-occurrence is the assumption of an intrinsic dependency between attributes without considering their relevance. The work presented by Ienco, Pensa & Meo (2012) and Jia, Cheung & Liu (2015) use the notion of contexts to evaluate pairs of categories. A context is an additional dimension used to determine the similarity between pairs.…”

Section: Patient Similarity and Distance Measures For Categorical Eventsmentioning

confidence: 99%

A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes

Alcaide

Aerts

2021

PeerJ Computer Science

View full text Add to dashboard Cite

A large number of clinical concepts are categorized under standardized formats that ease the manipulation, understanding, analysis, and exchange of information. One of the most extended codifications is the International Classification of Diseases (ICD) used for characterizing diagnoses and clinical procedures. With formatted ICD concepts, a patient profile can be described through a set of standardized and sorted attributes according to the relevance or chronology of events. This structured data is fundamental to quantify the similarity between patients and detect relevant clinical characteristics. Data visualization tools allow the representation and comprehension of data patterns, usually of a high dimensional nature, where only a partial picture can be projected. In this paper, we provide a visual analytics approach for the identification of homogeneous patient cohorts by combining custom distance metrics with a flexible dimensionality reduction technique. First we define a new metric to measure the similarity between diagnosis profiles through the concordance and relevance of events. Second we describe a variation of the Simplified Topological Abstraction of Data (STAD) dimensionality reduction technique to enhance the projection of signals preserving the global structure of data. The MIMIC-III clinical database is used for implementing the analysis into an interactive dashboard, providing a highly expressive environment for the exploration and comparison of patients groups with at least one identical diagnostic ICD code. The combination of the distance metric and STAD not only allows the identification of patterns but also provides a new layer of information to establish additional relationships between patient cohorts. The method and tool presented here add a valuable new approach for exploring heterogeneous patient populations. In addition, the distance metric described can be applied in other domains that employ ordered lists of categorical data.

show abstract

“…This approach yields three main kinds of distance relation. One is based on probability, which includes similarity relations that are information-theoretic centered, for example [2][3][4][5][6]; the next is based on the attribute space, for example [7][8][9]; and the other amounts to a specialization of a standard measure, such as Euclidean or Manhattan distance. All these measures overlook attribute interdependence, which, as noted in [10], may provide valuable information when capturing per-attribute object similarity.…”

Section: Introductionmentioning

confidence: 99%

Learning-Based Dissimilarity for Clustering Categorical Data

et al. 2021

View full text Add to dashboard Cite

Comparing data objects is at the heart of machine learning. For continuous data, object dissimilarity is usually taken to be object distance; however, for categorical data, there is no universal agreement, for categories can be ordered in several different ways. Most existing category dissimilarity measures characterize the distance among the values an attribute may take using precisely the number of different values the attribute takes (the attribute space) and the frequency at which they occur. These kinds of measures overlook attribute interdependence, which may provide valuable information when capturing per-attribute object dissimilarity. In this paper, we introduce a novel object dissimilarity measure that we call Learning-Based Dissimilarity, for comparing categorical data. Our measure characterizes the distance between two categorical values of a given attribute in terms of how likely it is that such values are confused or not when all the dataset objects with the remaining attributes are used to predict them. To that end, we provide an algorithm that, given a target attribute, first learns a classification model in order to compute a confusion matrix for the attribute. Then, our method transforms the confusion matrix into a per-attribute dissimilarity measure. We have successfully tested our measure against 55 datasets gathered from the University of California, Irvine (UCI) Machine Learning Repository. Our results show that it surpasses, in terms of various performance indicators for data clustering, the most prominent distance relations put forward in the literature.

show abstract

A New Distance Metric for Unsupervised Learning of Categorical Data

Cited by 86 publications

References 34 publications

An Efficient Technique for Disease Prediction by Using Enhanced Machine Learning Algorithms for Categorical Medical Dataset

An Efficient Technique for Disease Prediction by Using Enhanced Machine Learning Algorithms for Categorical Medical Dataset

A visual analytic approach for the identification of ICU patient subpopulations using ICD diagnostic codes

Learning-Based Dissimilarity for Clustering Categorical Data

Contact Info

Product

Resources

About