The distance function effect on k-nearest neighbor classification for medical datasets

Hu, Li; Huang, Min; Ke, Shih Wen; Tsai, Chih‐Fong

doi:10.1186/s40064-016-2941-7

Cited by 326 publications

(178 citation statements)

References 6 publications

Supporting

Mentioning

164

Contrasting

Unclassified

Order By: Relevance

“…The basis of kNN is the assumption that biologically similar samples will have similar measured values across most of their metabolites (Troyanskaya et al, 2001). To impute a missing value for one target sample, the k most similar samples are found based on a defined distance metric calculated using the values of metabolites that are present in both the target sample and a candidate neighbor sample (Hu et al, 2016; Kim et al, 2005). Here we test kNN with Euclidean distance in-depth for all methods of missing value generation and also examine kNN with Pearson correlation when comparing NS-kNN to KNN-TN using the more realistic MM approach.…”

Section: Methodsmentioning

confidence: 99%

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Lee

Styczynski

2018

Metabolomics

View full text Add to dashboard Cite

Introduction A common problem in metabolomics data analysis is the existence of a substantial number of missing values, which can complicate, bias, or even prevent certain downstream analyses. One of the most widely-used solutions to this problem is imputation of missing values using a k-nearest neighbors (kNN) algorithm to estimate missing metabolite abundances. kNN implicitly assumes that missing values are uniformly distributed at random in the dataset, but this is typically not true in metabolomics, where many values are missing because they are below the limit of detection (LOD) of the analytical instrumentation. Objectives Here, we explore the impact of nonuniformly distributed missing values (missing not at random, or MNAR) on imputation performance. We present a new model for generating synthetic missing data and a new algorithm, No-Skip kNN (NS-kNN), that accounts for MNAR values to provide more accurate imputations. Methods We compare the imputation errors of the original kNN algorithm using two distance metrics, NS-kNN, and a recently developed algorithm KNN-TN, when applied to multiple experimental datasets with different types and levels of missing data. Results Our results show that NS-kNN typically outperforms kNN when at least 20-30% of missing values in a dataset are MNAR. NS-kNN also has lower imputation errors than KNN-TN on realistic datasets when at least 50% of missing values are MNAR. Conclusion Accounting for the nonuniform distribution of missing values in metabolomics data can significantly improve the results of imputation algorithms. The NS-kNN method imputes missing metabolomics data more accurately than existing kNN-based approaches when used on realistic datasets.

show abstract

Section: Methodsmentioning

confidence: 99%

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

Lee

Styczynski

2018

Metabolomics

View full text Add to dashboard Cite

show abstract

“…Finally, for many applications, it is important to define a similarity or distance measure between two data points in the feature space. The simplest distance measure would be the Euclidean distance:

d false(A, B false) = \sqrt{{false\sum}_{i = 1}^{n} {(a_{i} - b_{i})}^{2}}

between the numerical feature vectors of two data points A and B , for features

i = 1 \dots n

, but depending on the type of data we are dealing with there can be many other and sometimes much more complex distance or similarity measures, such as cosine similarity or similarity scores of two biological sequences …”

Section: Data and Featuresmentioning

confidence: 99%

“…between the numerical feature vectors of two data points A and B, for features i = 1 … n, but depending on the type of data we are dealing with there can be many other and sometimes much more complex distance or similarity measures, such as cosine similarity 15 or similarity scores of two biological sequences. 16…”

Section: Data and Featuresmentioning

confidence: 99%

An Introduction to Machine Learning

Badillo

Bánfai

Birzele

et al. 2020

Clin Pharma and Therapeutics

214

158

View full text Add to dashboard Cite

In the last few years, machine learning (ML) and artificial intelligence have seen a new wave of publicity fueled by the huge and ever‐increasing amount of data and computational power as well as the discovery of improved learning algorithms. However, the idea of a computer learning some abstract concept from data and applying them to yet unseen situations is not new and has been around at least since the 1950s. Many of these basic principles are very familiar to the pharmacometrics and clinical pharmacology community. In this paper, we want to introduce the foundational ideas of ML to this community such that readers obtain the essential tools they need to understand publications on the topic. Although we will not go into the very details and theoretical background, we aim to point readers to relevant literature and put applications of ML in molecular biology as well as the fields of pharmacometrics and clinical pharmacology into perspective.

show abstract

“…There are different ways in which the distance (degree of similarity) can be computed and this often depends on the nature of the data. The Euclidean distance measure is the most popular but others like Chi-square distance, Minskowsky and cosine similarity measure also exist [7].…”

Section: Introductionmentioning

confidence: 99%

“…The kNN algorithm is often said to be a lazy machine learning classifier as there is no training per se [7], i.e. there is no learning or no model is actually built as it is an examplebased classifier.…”

Section: Introductionmentioning

confidence: 99%

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

Pudaruth

Soyjaudah

Gunputh

2017

Complex Intell. Syst.

View full text Add to dashboard Cite

The classification of legal documents has been receiving considerate attention over the last few years. This is mainly because of the over-increasing amount of legal information that is being produced on a daily basis in the courts of law. In the Republic of Mauritius alone, a total of 141,164 cases were lodged in the different courts in the year 2015. The Judiciary of Mauritius is becoming more efficient due to a number of measures which were implemented and the number of cases disposed of in each year has also risen significantly; however, this is still not enough to catch up with the increase in the number of new cases that are lodged. In this paper, we used the k-nearest neighbour machine learning classifier in a novel way. Unlike news article, judgments are complex documents which usually span several pages and contains a variety of information about a case. Our approach consists of splitting the documents into equal-sized segments. Each segment is then classified independently of the others. The selection of the predicted category is then done through a plurality voting procedure. Using this novel approach, we have been able to classify law cases with an accuracy of over 83.5%, which is 10.5% higher than when using the whole documents dataset. To the best of our knowledge, this type of process has never been used earlier to categorise legal judgments or other types of documents. In this work, we also

show abstract

The distance function effect on k-nearest neighbor classification for medical datasets

Cited by 326 publications

References 6 publications

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data

An Introduction to Machine Learning

An innovative multi-segment strategy for the classification of legal judgments using the k-nearest neighbour classifier

Contact Info

Product

Resources

About