Literature Review on Feature Selection Methods for High-Dimensional Data

Singh, Devender; Balamurugan, S.; Leavline, E. Jebamalar

doi:10.5120/ijca2016908317

Cited by 68 publications

(43 citation statements)

References 72 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The random search examines feature space in a random manner. It can begin with a random feature or specified feature and add features randomly to get the best subset found [37][38][39].…”

Section: Feature Selectionmentioning

confidence: 99%

Breast Cancer Diagnosis Using an Efficient CAD System Based on Multiple Classifiers

2019

View full text Add to dashboard Cite

Breast cancer is one of the major health issues across the world. In this study, a new computer-aided detection (CAD) system is introduced. First, the mammogram images were enhanced to increase the contrast. Second, the pectoral muscle was eliminated and the breast was suppressed from the mammogram. Afterward, some statistical features were extracted. Next, k-nearest neighbor (k-NN) and decision trees classifiers were used to classify the normal and abnormal lesions. Moreover, multiple classifier systems (MCS) was constructed as it usually improves the classification results. The MCS has two structures, cascaded and parallel structures. Finally, two wrapper feature selection (FS) approaches were applied to identify those features, which influence classification accuracy. The two data sets (1) the mammographic image analysis society digital mammogram database (MIAS) and (2) the digital mammography dream challenge were combined together to test the CAD system proposed. The highest accuracy achieved with the proposed CAD system before FS was 99.7% using the Adaboosting of the J48 decision tree classifiers. The highest accuracy after FS was 100 %, which was achieved with k‐NN classifier. Moreover, the area under the curve (AUC) of the receiver operating characteristic (ROC) curve was equal to 1.0. The results showed that the proposed CAD system was able to accurately classify normal and abnormal lesions in mammogram samples.

show abstract

“…The random search examines feature space in a random manner. It can begin with a random feature or specified feature and add features randomly to get the best subset found [37][38][39].…”

Section: Feature Selectionmentioning

confidence: 99%

Breast Cancer Diagnosis Using an Efficient CAD System Based on Multiple Classifiers

2019

View full text Add to dashboard Cite

show abstract

“…One of the challenges in data mining is high dimensional data analysis [1][2][3][4][5][6][7]. Having a small sample set adds to the difficulty of the problem.…”

Section: Introductionmentioning

confidence: 99%

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Biglari

Mirzaei

Hassanpour

2020

IJE

View full text Add to dashboard Cite

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering algorithms perform poorly on this kind of data. In this paper, a novel hybrid feature selection technique is proposed, which can reduce drastically the number of features with an acceptable loss of prediction accuracy. The proposed approach operates in multiple stages, starting by removing irrelevant features with a low discrimination power, and then eliminating the ones with low variation range. Afterward, among each set of features with high cross-correlation, a single feature that is strongly correlated with the output is kept. Finally, a Genetic Algorithm with a customized cost function is provided to select a small subset of the remainder of features. To show the effectiveness of the proposed approach, we investigated two challenging case studies with sample set sizes of about 100 and the number of features larger than 1000. The experimental results look promising as they showed a percentage decrease of more than 99% in the number of features, with a prediction accuracy of more than 92%.

show abstract

“…Generally, high-dimensional remotely sensed datasets contain irrelevant information and highly redundant features. Such dimensionality deteriorates quantitative (e.g., leaf area index and biomass) and qualitative (e.g., land-cover) performance of statistical algorithms by overfitting data [10]. High dimensional data are often associated with the Hughes effects or the curse of dimensionality, a phenomenon that occurs when the number of features in a dataset is greater than the number of samples [11,12].…”

Section: Introductionmentioning

confidence: 99%

“…There are two main components of dimension reduction strategies: feature extraction or construction and feature selection or feature ranking. Feature extraction (e.g., Principle Component Analysis (PCA)), constructs a new and low dimensional feature space using linear or non-linear combinations of the original high-dimensional feature space [14] while feature selection (e.g., Fisher Score and Information Gain) extracts subsets from existing features [10]. Although feature extraction methods produce higher classification accuracies, the interpretation of generated results is often challenging [2].…”

Section: Introductionmentioning

confidence: 99%

Feature Selection on Sentinel-2 Multispectral Imagery for Mapping a Landscape Infested by Parthenium Weed

et al. 2019

View full text Add to dashboard Cite

In the recent past, the volume of spatial datasets has significantly increased. This is attributed to, among other factors, higher sensor temporal resolutions of the recently launched satellites. The increased data, combined with the computation and possible derivation of a large number of indices, may lead to high multi-collinearity and redundant features that compromise the performance of classifiers. Using dimension reduction algorithms, a subset of these features can be selected, hence increasing their predictive potential. In this regard, an investigation into the application of feature selection techniques on multi-temporal multispectral datasets such as Sentinel-2 is valuable in vegetation mapping. In this study, ten feature selection methods belonging to five groups (Similarity-based, statistical-based, Sparse learning based, Information theoretical based, and wrappers methods) were compared based on f-score and data size for mapping a landscape infested by the Parthenium weed (Parthenium hysterophorus). Overall, results showed that ReliefF (a Similarity-based approach) was the best performing feature selection method as demonstrated by the high f-score values of Parthenium weed and a small size of optimal features selected. Although svm-b (a wrapper method) yielded the highest accuracies, the size of optimal subset of selected features was quite large. Results also showed that data size affects the performance of feature selection algorithms, except for statistically-based methods such as Gini-index and F-score and svm-b. Findings in this study provide a guidance on the application of feature selection methods for accurate mapping of invasive plant species in general and Parthenium weed, in particular, using new multispectral imagery with high temporal resolution.

show abstract

Literature Review on Feature Selection Methods for High-Dimensional Data

Cited by 68 publications

References 72 publications

Breast Cancer Diagnosis Using an Efficient CAD System Based on Multiple Classifiers

Breast Cancer Diagnosis Using an Efficient CAD System Based on Multiple Classifiers

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature Selection on Sentinel-2 Multispectral Imagery for Mapping a Landscape Infested by Parthenium Weed

Contact Info

Product

Resources

About