A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

Kadir, Md. Eusha; Akash, Pritom Saha; Sharmin, Sadia; Ali, Amin Ahsan; Shoyaib, Mohammad

doi:10.1007/978-3-030-47436-2_6

Cited by 8 publications

(8 citation statements)

References 17 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Selection of the informative features in the MGS step was implemented using MATLAB and ranking these informative features in MGS f and MGS rf step was implemented using Python with the package scikit-learn [ 41 ]. To evaluate the performance of the proposed and existing methods, different classifiers such as SVM, RF classifiers, XGboost [ 42 ], PE k NN [ 43 ] can be used. In this paper, we only use two simple classifiers namely SVM (linear kernel) and Random Forest to compare different methods.…”

Section: Methodsmentioning

confidence: 99%

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

et al. 2021

Self Cite

View full text Add to dashboard Cite

With the advent of high-throughput technologies, life sciences are generating a huge amount of varied biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed in a cell or in a tissue under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed with relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly attributing to a particular phenotype or condition, (such as cancer), de novo. For identifying the key genes from gene expression data, among the existing literature, mutual information (MI) is one of the most successful criteria. However, the correction of MI for finite sample is not taken into account in this regard. It is also important to incorporate dynamic discretization of genes for more relevant gene selection, although this is not considered in the available methods. Besides, it is usually suggested in current studies to remove redundant genes which is particularly inappropriate for biological data, as a group of genes may connect to each other for downstreaming proteins. Thus, despite being redundant, it is needed to add the genes which provide additional useful information for the disease. Addressing these issues, we proposed Mutual information based Gene Selection method (MGS) for selecting informative genes. Moreover, to rank these selected genes, we extended MGS and propose two ranking methods on the selected genes, such as MGSf—based on frequency and MGSrf—based on Random Forest. The proposed method not only obtained better classification rates on gene expression datasets derived from different gene expression studies compared to recently reported methods but also detected the key genes relevant to pathways with a causal relationship to the disease, which indicate that it will also able to find the responsible genes for an unknown disease data.

show abstract

Section: Methodsmentioning

confidence: 99%

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

et al. 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Many references confirm that the performance of KNN is affected by data imbalance. 21,24,33 Conceptually the KNN algorithm calculates the (Euclidian) distances of the training data set observations from a validation sample (i.e., to be labeled) and assigns the majority label of KNNs of the validation sample to it. Therefore, if the observations of each class (in the feature space of the data set) have little distance from each other and a small parameter K is selected, the performance of this algorithm will not be affected by data imbalance.…”

Section: Knn Parameter Selectionmentioning

confidence: 99%

“…Although KNN is a simple and accurate algorithm it has some weaknesses such as being biased toward majority observations when facing data imbalance. 21 Several types of research have been performed to improve KNN performance against data imbalance by using oversampling, 22 boosting-byresample strategy, 23 misclassification cost, 24 and so forth. This algorithm has been used for fault detection in a wide area of subjects such as power systems, 25,26 railway point systems, 27 nuclear power plants, 28 and especially WTs, 20,[29][30][31] however, the focus of previous FDI works is not mainly on data imbalance challenge.…”

Section: Introductionmentioning

confidence: 99%

“…KNN is a nonparametric supervised learning algorithm that can be used for regression and classification predictions. Although KNN is a simple and accurate algorithm it has some weaknesses such as being biased toward majority observations when facing data imbalance 21 . Several types of research have been performed to improve KNN performance against data imbalance by using oversampling, 22 boosting‐by‐resample strategy, 23 misclassification cost, 24 and so forth.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Wind turbine fault detection and isolation robust against data imbalance using KNN

Fazli,

Poshtan

2024

Energy Science & Engineering

View full text Add to dashboard Cite

Due to the difficulties of system modeling, nonlinearity effects, uncertainties, and the availability of Wind Turbines (WTs) SCADA system data, data‐driven Fault Detection and Isolation (FDI) methods for WTs have received increasing attention. In this paper, using the wind turbine SCADA data, an effective FDI scheme is proposed using the K‐Nearest Neighbors (KNN) classifier. The operational data set is labeled by the status and warning data sets, and the labeled operational data set, after eliminating invalid data, feature selection, and standardization, is used for training and validation of the FDI model. Data imbalance, which is common in real data sets, does not affect the performance of the proposed method, hence there is no need for data balancing methods in this algorithm and the performance is not deteriorated by occurring false alarms. Therefore, the proposed method has provided impressive performance in FDI compared with previous research on this data set. Also, many of the fault classes addressed in this paper were not considered in previous works on this data set.

show abstract

“…Although the kNN algorithm is a versatile technique for classification tasks, it has some drawbacks, such as determining a secure way of choosing the k parameter, being sensitive to the similarity (distance) function used (Kotsiantis, Zaharakis & Pintelas, 2006), and a large amount of storage for large datasets (Harrington, 2012). As the kNN considers the most frequent class of its nearest neighbors, it is intuitive to conclude that for imbalanced datasets, the method will bias the results towards the majority class in the training dataset (Kadir et al, 2020).…”

Section: K-nearest Neighborsmentioning

confidence: 99%

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Dorn

Grisci

Narloch

et al. 2021

PeerJ Computer Science

View full text Add to dashboard Cite

The Coronavirus pandemic caused by the novel SARS-CoV-2 has significantly impacted human health and the economy, especially in countries struggling with financial resources for medical testing and treatment, such as Brazil’s case, the third most affected country by the pandemic. In this scenario, machine learning techniques have been heavily employed to analyze different types of medical data, and aid decision making, offering a low-cost alternative. Due to the urgency to fight the pandemic, a massive amount of works are applying machine learning approaches to clinical data, including complete blood count (CBC) tests, which are among the most widely available medical tests. In this work, we review the most employed machine learning classifiers for CBC data, together with popular sampling methods to deal with the class imbalance. Additionally, we describe and critically analyze three publicly available Brazilian COVID-19 CBC datasets and evaluate the performance of eight classifiers and five sampling techniques on the selected datasets. Our work provides a panorama of which classifier and sampling methods provide the best results for different relevant metrics and discuss their impact on future analyses. The metrics and algorithms are introduced in a way to aid newcomers to the field. Finally, the panorama discussed here can significantly benefit the comparison of the results of new ML algorithms.

show abstract

A Proximity Weighted Evidential k Nearest Neighbor Classifier for Imbalanced Data

Cited by 8 publications

References 17 publications

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

Use of relevancy and complementary information for discriminatory gene selection from high-dimensional gene expression data

Wind turbine fault detection and isolation robust against data imbalance using KNN

Comparison of machine learning techniques to handle imbalanced COVID-19 CBC datasets

Contact Info

Product

Resources

About