Affinity Learning for Mixed Data Clustering

Li, Nan; Latecki, Longin Jan

doi:10.24963/ijcai.2017/302

Cited by 8 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The two latter packages provide both a large number of clustering and cluster stability assessment methods and functions to compute dissimilarity matrices and describe the results). Another known alternative consists of one-hot-encoding categorical data into binary variables and treating the latter as continuous (e.g., Li and Latecki, 2017 ). It is, however, necessary to down-weight the variables obtained, so that no more weight is given to the original variables with more modalities.…”

Section: Statistical Rationale and Literature Review On Data Clusteringmentioning

confidence: 99%

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Esnault¹,

Rollot²,

Guilmin³

et al. 2023

Front. Artif. Intell.

View full text Add to dashboard Cite

The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.

show abstract

Section: Statistical Rationale and Literature Review On Data Clusteringmentioning

confidence: 99%

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Esnault¹,

Rollot²,

Guilmin³

et al. 2023

Front. Artif. Intell.

View full text Add to dashboard Cite

show abstract

“…Another known alternative consists of one-hot-encoding categorical data into binary variables and treating the latter as continuous (e.g. in Li et al 2017). It is however necessary to downweigh the obtained variables so that no more weight is given to the original variables with more modalities.…”

Section: Choosing An Appropriate Clustering Approachmentioning

confidence: 99%

Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data

Esnault¹,

Rollot²,

Guilmin³

et al. 2022

Preprint

View full text Add to dashboard Cite

The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant diversity in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this paper proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity, as it is suitable regardless of the data volume (small/big) and regardless of the nature of the variables (continuous/qualitative/mixed), (2) ease of implementation, as it is based on few easy-to-use software packages, and (3) robustness, through the stability evaluation of the final clusters and through recognized algorithms and implementations. This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors.

show abstract

“…Positive unlabelled (PU) inference is based on data sets containing labelled observations (S = 1) which are all positive (Y = 1), and unlabelled ones (S = 0) which may either belong to a positive or a negative class (Y is either 1 or 0). Examples of such experimental setup abound in medicine [36,22,6,38], text and image analysis [9,27,26,15], ecology [37,29] and survey data [33]. For example, medical databases may contain only information about diagnosed patients who have a certain disease (S = 1) whereas un-diagnosed patients (S = 0) may have it or not.…”

Section: Introductionmentioning

confidence: 99%

Double Logistic Regression Approach to Biased Positive-Unlabeled Data

Furmańczyk,

Mielniczuk,

Rejchel

et al. 2023

Frontiers in Artificial Intelligence and Applications

View full text Add to dashboard Cite

Positive and unlabelled learning is an important non-standard inference problem which arises naturally in many applications. The significant limitation of almost all existing methods addressing it lies in assuming that the propensity score function is constant and does not depend on features (Selected Completely at Random assumption), which is unrealistic in many practical situations. Avoiding this assumption, we consider parametric approach to the problem of joint estimation of posterior probability and propensity score functions. We show that if both these functions are logistic with different parameters (double logistic model) then the corresponding parameters are identifiable. Motivated by this, we propose two approaches to their estimation: a joint maximum likelihood method and the second approach based on an alternating maximization of two Fisher consistent approximations. Our experimental results show that the proposed methods perform on par or better than the existing methods based on Expectation-Maximisation scheme.

show abstract

Affinity Learning for Mixed Data Clustering

Cited by 8 publications

References 4 publications

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Qluster: An easy-to-implement generic workflow for robust clustering of health data

Qluster: An Easy-to-Implement Generic Workflow for Robust Clustering of Health Data

Double Logistic Regression Approach to Biased Positive-Unlabeled Data

Contact Info

Product

Resources

About