A supervised learning approach for detecting erroneous samples in embeddings

Saygılı, Görkem

doi:10.3906/elk-1909-162

Cited by 2 publications

(3 citation statements)

References 20 publications

(31 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, when we examine an embedding perceptually, what we consider as an erroneous sample is the one that does not belong to the class of the majority of its neighbors. In a previous study [29], an error detection algorithm based on classification was presented for dimensionality reduction. We advocate that a binary classifier would be inferior than a regressor, since there is no threshold value suitable for every dataset.…”

Section: Discussionmentioning

confidence: 99%

“…In the literature, there are many studies aiming to generate confidence and detect errors for various domains such as medical image registration [22][23][24][25] and stereo matching [26][27][28]. Recently, DR becomes also the focus of error estimation research [29]. Evaluating and comparing the embeddings are typically done qualitatively, by placing projections side by side and letting human judgment to determine which projection is the best.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Confidence estimation for t-SNE embeddings using random forest

Yigin

Saygılı

2022

Int. J. Mach. Learn. & Cyber.

View full text Add to dashboard Cite

Dimensionality reduction algorithms are commonly used for reducing the dimension of multi-dimensional data to visualize them on a standard display. Although many dimensionality reduction algorithms such as the t-distributed Stochastic Neighborhood Embedding aim to preserve close neighborhoods in low-dimensional space, they might not accomplish that for every sample of the data and eventually produce erroneous representations. In this study, we developed a supervised confidence estimation algorithm for detecting erroneous samples in embeddings. Our algorithm generates a confidence score for each sample in an embedding based on a distance-oriented score and a random forest regressor. We evaluate its performance on both intra- and inter-domain data and compare it with the neighborhood preservation ratio as our baseline. Our results showed that the resulting confidence score provides distinctive information about the correctness of any sample in an embedding compared to the baseline. The source code is available at https://github.com/gsaygili/dimred.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Confidence estimation for t-SNE embeddings using random forest

Yigin

Saygılı

2022

Int. J. Mach. Learn. & Cyber.

View full text Add to dashboard Cite

show abstract

“…Ranking-based metrics 5 , 6 focus on retaining local neighborhood rankings in high and low dimensions instead of considering the preservation of ground truth target labels (label-based). There are also some label-based error detection and confidence estimation methods that have been developed specifically for t-SNE embeddings 7 , 8 , in a similar way to those in many other domains such as medical image registration 9 , 10 and stereo matching 11 . What makes the label-based confidence estimation algorithm 8 unique is that it generates confidence scores for each and every sample in a t-SNE embedding with a supervised Random Forest (RF) regression algorithm based on target class labels.…”

Section: Introductionmentioning

confidence: 99%

Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data

Yigin

Saygılı

2023

Sci Rep

View full text Add to dashboard Cite

Arguably one of the most famous dimensionality reduction algorithms of today is t-distributed stochastic neighbor embedding (t-SNE). Although being widely used for the visualization of scRNA-seq data, it is prone to errors as any algorithm and may lead to inaccurate interpretations of the visualized data. A reasonable way to avoid misinterpretations is to quantify the reliability of the visualizations. The focus of this work is first to find the best possible way to predict sample-based confidence scores for t-SNE embeddings and next, to use these confidence scores to improve the clustering algorithms. We adopt an RF regression algorithm using seven distance measures as features for having the sample-based confidence scores with a variety of different distance measures. The best configuration is used to assess the clustering improvement using K-means and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) based on Adjusted Rank Index (ARI), Normalized Mutual Information (NMI), and accuracy (ACC) scores. The experimental results show that distance measures have a considerable effect on the precision of confidence scores and clustering performance can be improved substantially if these confidence scores are incorporated before the clustering algorithm. Our findings reveal the usefulness of these confidence scores on downstream analyses for scRNA-seq data.

show abstract

A supervised learning approach for detecting erroneous samples in embeddings

Cited by 2 publications

References 20 publications

Confidence estimation for t-SNE embeddings using random forest

Confidence estimation for t-SNE embeddings using random forest

Effect of distance measures on confidences of t-SNE embeddings and its implications on clustering for scRNA-seq data

Contact Info

Product

Resources

About