On some transformations of high dimension, low sample size data for nearest neighbor classification

Dutta, Subhajit; Ghosh, Anil K.

doi:10.1007/s10994-015-5495-y

Cited by 17 publications

(14 citation statements)

References 27 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Beyer et al (1999) show that distance concentration can occur with as few as 15 dimensions. See Dutta & Ghosh (2016) and Hall et al (2005).…”

Section: Comparison Of Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Interpoint Distance Classification of High Dimensional Discrete Observations

Guo

Modarres

2018

Int Statistical Rev

View full text Add to dashboard Cite

Summary Classification is a multivariate technique that is concerned with allocating new observations to two or more groups. We use interpoint distances to measure the closeness of the samples and construct new rules for high dimensional classification of discrete observations. Applicable to high dimensional data, the new method is non‐parametric and uses test‐based classification with permutation testing. We propose a modification of a test‐based rule to use relative values with respect to the training samples baseline. We compare the proposed rule with parametric methods, such as likelihood ratio rule and modified linear discriminate function, and non‐parametric techniques such as support vector machine, nearest neighbour and depth‐based classification, under multivariate Bernoulli, multinomial and multivariate Poisson distributions.

show abstract

“…Beyer et al (1999) show that distance concentration can occur with as few as 15 dimensions. See Dutta & Ghosh (2016) and Hall et al (2005).…”

Section: Comparison Of Methodsmentioning

confidence: 99%

“…In Table , we can see that increasing the number of trials increases the discriminant power in all classifiers. Dutta & Ghosh () show the adverse effects of high dimensions on the performance of classic NN classifier. This is also seen in Table .…”

Section: Simulationmentioning

confidence: 99%

Interpoint Distance Classification of High Dimensional Discrete Observations

Guo

Modarres

2018

Int Statistical Rev

View full text Add to dashboard Cite

show abstract

“…Let Z be a new observation to be classified. Dutta and Ghosh 22 illustrate a transformation based on the IPDs to classify Z . For N x , N y ≥2, these transformed data points are given as follows:

\begin{align} {boldX}_{i}^{*} & = {((), \frac{false| false| {boldX}_{i} prefix- {boldX}_{1} false| false|}{\sqrt{d}}, \dots, \frac{false| false| {boldX}_{i} prefix- {boldX}_{N_{x}} false| false|}{\sqrt{d}}, \frac{false| false| {boldX}_{i} prefix- {boldY}_{1} false| false|}{\sqrt{d}}, \dots, \frac{false| false| {boldX}_{i} prefix- {boldY}_{N_{y}} false| false|}{\sqrt{d}})}^{'}, \\ {boldY}_{j}^{*} & = {((), \frac{false| false| {boldY}_{j} prefix- {boldX}_{1} false| false|}{\sqrt{d}}, \dots, \frac{false| false| {boldY}_{j} prefix- {boldX}_{N_{x}} false| false|}{\sqrt{d}}, \frac{false| false| {boldY}_{j} prefix- {boldY}_{1} false| false|}{\sqrt{d}}, \dots, \frac{false| false| {boldY}_{j} prefix- {boldY}_{N_{y}} false| false|}{\sqrt{d}})}^{'} . \end{align}

…”

Section: Applicationsunclassified

Interpoint distances: Applications, properties, and visualization

Modarres

Song

2020

Appl Stoch Models Bus & Ind

View full text Add to dashboard Cite

This article surveys recent development on Euclidean interpoint distances (IPDs). IPDs find applications in many scientific fields and are the building blocks of several multivariate techniques such as comparison of distributions, clustering, classification, and multidimensional scaling. In this article, we explore IPDs, discuss their properties and applications, and present their distributions for several families, including the multivariate normal, multivariate Bernoulli, multivariate power series, and the unified hypergeometric distributions. We consider two groups of observations in R d and present a simultaneous plot of the empirical cumulative distribution functions of the within and between IPDs to visualize and examine the equality of the underlying distribution functions of the observations.

show abstract

“…When a Big Data problem is presented as a domain with a large number of characteristics, dimensionality reduction approaches may be needed Dutta and Ghosh () to accelerate distance compensations in nearest neighbors classification. The Locality‐sensitive hashing (LSH) Andoni and Indyk () algorithm is a well‐known example that reduces the dimensionality of the data using hash functions with the particularity of looking for a collision between instances that are similar.…”

Section: The K‐nn Algorithm In Big Data: Current and Future Trendsmentioning

confidence: 99%

Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data

Triguero

García-Gil

Maillo

et al. 2018

WIREs Data Min & Knowl

126

View full text Add to dashboard Cite

The k‐nearest neighbors algorithm is characterized as a simple yet effective data mining technique. The main drawback of this technique appears when massive amounts of data—likely to contain noise and imperfections—are involved, turning this algorithm into an imprecise and especially inefficient technique. These disadvantages have been subject of research for many years, and among others approaches, data preprocessing techniques such as instance reduction or missing values imputation have targeted these weaknesses. As a result, these issues have turned out as strengths and the k‐nearest neighbors rule has become a core algorithm to identify and correct imperfect data, removing noisy and redundant samples, or imputing missing values, transforming Big Data into Smart Data—which is data of sufficient quality to expect a good outcome from any data mining algorithm. The role of this smart data gleaning algorithm in a supervised learning context are investigated. This includes a brief overview of Smart Data, current and future trends for the k‐nearest neighbor algorithm in the Big Data context, and the existing data preprocessing techniques based on this algorithm. We present the emerging big data‐ready versions of these algorithms and develop some new methods to cope with Big Data. We carry out a thorough experimental analysis in a series of big datasets that provide guidelines as to how to use the k‐nearest neighbor algorithm to obtain Smart/Quality Data for a high‐quality data mining process. Moreover, multiple Spark Packages have been developed including all the Smart Data algorithms analyzed. This article is categorized under: Technologies > Data Preprocessing Fundamental Concepts of Data and Knowledge > Big Data Mining Technologies > Classification

show abstract

On some transformations of high dimension, low sample size data for nearest neighbor classification

Cited by 17 publications

References 27 publications

Interpoint Distance Classification of High Dimensional Discrete Observations

Interpoint Distance Classification of High Dimensional Discrete Observations

Interpoint distances: Applications, properties, and visualization

Transforming big data into smart data: An insight on the use of the k‐nearest neighbors algorithm to obtain quality data

Contact Info

Product

Resources

About