Classification performance of three approaches for combining data sampling and gene selection on bioinformatics data

Khoshgoftaar, Taghi M.; Fazelpour, Alireza; Dittman, David J.; Napolitano, Amri

doi:10.1109/iri.2014.7051906

Cited by 9 publications

(3 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The findings of this work are partially consistent with the results recently discussed in [21], where the effectiveness of combining RUS and feature selection is evaluated in conjunction with different classifiers and feature selection methods, but within less severe imbalance settings (min_pct > 10%). The beneficial impact of sampling-based approaches on high-dimensional bioinformatics datasets is also explored in [30]- [32]. In particular, [30] relies on both RUS and feature selection, and investigates the extent to which the order of these pre-processing operations impacts on the classification results.…”

Section: Discussionmentioning

confidence: 99%

“…The beneficial impact of sampling-based approaches on high-dimensional bioinformatics datasets is also explored in [30]- [32]. In particular, [30] relies on both RUS and feature selection, and investigates the extent to which the order of these pre-processing operations impacts on the classification results. As well, [31] exploits both RUS and feature selection and shows that using fully balanced data significantly improves the SVM performance in protein function prediction tasks.…”

Section: Discussionmentioning

confidence: 99%

“…While several research efforts have explored the issues of high dimensionality and class imbalance independently, only a few studies have addressed both the problems simultaneously [30]- [35]. Since several biomedical datasets are both high-dimensional and class-imbalanced, the aim of this work is to investigate the effectiveness of learning strategies that are designed to handle simultaneously both the issues, in order to effectively deal with real-world problems that involve the classification of rare pathological conditions (e.g., rare cancer types).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning From High-Dimensional Biomedical Datasets: The Issue of Class Imbalance

Pes

2020

IEEE Access

View full text Add to dashboard Cite

As witnessed by a vast corpus of literature, dimensionality reduction is a fundamental step for biomedical data analysis. Indeed, in this domain, there is often the need for coping with a huge number of data attributes (or features). By removing irrelevant or redundant attributes, feature selection techniques can significantly reduce the complexity of the original problem, with important benefits in terms of domain understanding and knowledge discovery. When learning from biomedical data, however, the dimensionality issue is often addressed without a joint consideration of other critical aspects that may compromise the performance of the induced models. The adverse implications of an imbalanced class distribution, for example, are often neglected in this domain. The aim of this work is to investigate the effectiveness of hybrid learning strategies that incorporate both methods for dimensionality reduction as well as methods for alleviating the issue of class imbalance. Specifically, we combine different feature selection techniques, both univariate and multivariate, with sampling-based class balancing methods and cost-sensitive classification. The performance of the resulting learning schemes is experimentally evaluated on six highdimensional genomic benchmarks, using different classification algorithms, with interesting insight about the best strategies to use based on the characteristics of the data at hand.

show abstract