Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Blagus, Rok; Lusa, Lara

doi:10.1186/1471-2105-14-64

Cited by 39 publications

(26 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…At the data level, sample rescaling and resampling strategies have been used to balance data by changing the distribution of samples in different classes, including oversampling (SMOTE) and undersampling (RUS) methods. [47] At the algorithmic level, ac ost-sensitive learning approach( class weight)h as also been attempted by setting an excessive cost function to misclassification of am inority class sample. [48] In addition, an ensemble classifier combined with the resampling method such as SMOTE + ENN, which is an ovel and promising route to reduce the influence of information loss or information overfit,w as used for comparison.…”

Section: Discussionmentioning

confidence: 99%

In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods

et al. 2018

View full text Add to dashboard Cite

The blood-brain barrier (BBB) as a part of absorption protects the central nervous system by separating the brain tissue from the bloodstream. In recent years, BBB permeability has become a critical issue in chemical ADMET prediction, but almost all models were built using imbalanced data sets, which caused a high false-positive rate. Therefore, we tried to solve the problem of biased data sets and built a reliable classification model with 2358 compounds. Machine learning and resampling methods were used simultaneously for the refinement of models with both 2 D molecular descriptors and molecular fingerprints to represent the chemicals. Through a series of evaluation, we realized that resampling methods such as Synthetic Minority Oversampling Technique (SMOTE) and SMOTE+edited nearest neighbor could effectively solve the problem of imbalanced data sets and that MACCS fingerprint combined with support vector machine performed the best. After the final construction of a consensus model, the overall accuracy rate was increased to 0.966 for the final external data set. Also, the accuracy rate of the model for the test set was 0.919, with an excellent balanced capacity of 0.925 (sensitivity) to predict BBB-positive compounds and of 0.899 (specificity) to predict BBB-negative compounds. Compared with other BBB classification models, our models reduced the rate of false positives and were more robust in prediction of BBB-positive as well as BBB-negative compounds, which would be quite helpful in early drug discovery.

show abstract

Section: Discussionmentioning

confidence: 99%

In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Consequently the constructed model has low quality prediction because all objects are assigned to the dominant, negative 2 , class, regardless the value of the feature vector [19]. The bias of classification of the imbalanced data in favor of the majority class is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples [4,17,26,30,69]. And in the vast majority of medical datasets just such a situation occurs [43].…”

Section: Learning From Imbalanced Datamentioning

confidence: 96%

“…It happens that the disproportion of samples from each class is on the order of 100:1, 1 000:1 or even 10 000:1 [26]. The usage of the conventional learning methods for imbalanced data results in constructing a decision model to the majority class, which is predominant in the training set [4,26,39,41,52]. Consequently the constructed model has low quality prediction because all objects are assigned to the dominant, negative 2 , class, regardless the value of the feature vector [19].…”

Section: Learning From Imbalanced Datamentioning

confidence: 99%

“…Also a number of academic works have proposed the new approaches to classification of examined individuals as belonging to the group of patients or to identify the diagnosis factors of [4,10,23,47] contain descriptions of new methods developed to be used in the intelligent medical systems. But only some of them touch the problem of the imbalance of data.…”

Section: Learning From Imbalanced Datamentioning

confidence: 99%

See 1 more Smart Citation

The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis

Bach

Werner

Żywiec³

et al. 2017

Information Sciences

119

View full text Add to dashboard Cite

“…Additionally, class imbalance is an important consideration in classification of biomedical data, and there are techniques [4] which incorporate class distribution within the classification algorithm. Our approach is different in that we separate the classification from data preprocessing where we assume class imbalance is to be handled.…”

Section: Introductionmentioning

confidence: 99%

Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data

et al. 2016

View full text Add to dashboard Cite

BackgroundHigh dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy.MethodsGene masking is implemented via a binary encoded genetic algorithm that can be integrated seamlessly with classifiers during the training phase of classification to perform feature selection. It can also be used to discriminate between features that contribute most to the classification, thereby, allowing researchers to isolate features that may have special significance.ResultsThis technique was applied on publicly available datasets whereby it substantially reduced the number of features used for classification while maintaining high accuracies.ConclusionThe proposed technique can be extremely useful in feature selection as it heuristically removes non-contributing features to improve the performance of classifiers.

show abstract

Improved shrunken centroid classifiers for high-dimensional class-imbalanced data

Cited by 39 publications

References 25 publications

In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods

In Silico Prediction of Blood–Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods

The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis

Gene masking - a technique to improve accuracy for cancer classification with high dimensionality in microarray data

Contact Info

Product

Resources

About