Distributed correlation-based feature selection in spark

Palma-Mendoza, Raul-Jose; de‐Marcos, Luis; Rodríguez, Daniel; Alonso‐Betanzos, Amparo

doi:10.1016/j.ins.2018.10.052

Cited by 40 publications

(20 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can say that this heuristic is the core concept of the CFS algorithm. It is a filtering method that applies a principle derived from Ghiselly test theory -good subsets of features contain features highly correlated with the class but uncorrelated with each other [12][13][14][15]. The CFS feature subset evaluation function is defined as [12,14]:…”

Section: Methodsmentioning

confidence: 99%

Classification and action rules in identification and self-care assessment problems

Zdrodowska

Dardzińska

2021

THC

View full text Add to dashboard Cite

BACKGROUND: Disability, especially in children, is a very important and current problem. Lack of proper diagnosis and care increases the difficulty for children to adapt to disabilities. Disabled children have many problems with basic activities of daily living. Therefore, it is very important to support diagnosticians and physiotherapists in recognizing self-care problems in children. OBJECTIVE: The aim of this paper is to extract classification and action rules, useful for those who work with children with disabilities. METHODS: First, features and their impact on the accuracy of classification are determined. Then, two models are built: one with all features and one with selected ones. For these models the classification rules are extracted. Finally, action rules are mined and the next step in treatment process is predicted. RESULTS: Seventeen features with the greatest impact on classifying a child into a particular group of self-care problems were identified. Based on the implemented algorithms, decision and action rules were obtained. CONCLUSIONS: The obtained model, selected attributes and extracted classification and action rules can support the work of therapists and direct their work to those areas of disability where even a minimal reduction of features would be of great benefit to the children.

show abstract

Section: Methodsmentioning

confidence: 99%

Classification and action rules in identification and self-care assessment problems

Zdrodowska

Dardzińska

2021

THC

View full text Add to dashboard Cite

show abstract

“…where is subset feature value, is number of features, ̅̅̅̅ is average value of class minus the feature correlation, and ̅ is average value of feature minus the feature intercorrelation [6].…”

Section: The Cfs Techniquementioning

confidence: 99%

“…ReliefF gives a ranking value to each feature against its class attributes; the features with the highest weight will positively impact the classification process. Meanwhile, the CFS helps assess whether a subset of features uses merit_s calculations based on the correlation between features and classes, as well as the correlation between features with other features; the greater the merit_s value of a subset, the better its impact on the classification process [6]. The support vector machine (SVM) classification technique was chosen because it can produce better accuracy with microarray data compared to several other classification techniques [7][8][9].…”

Section: Introductionmentioning

confidence: 99%

Comparative analysis of ReliefF-SVM and CFS-SVM for microarray data classification

Hakim

Adiwijaya

Astuti

2021

IJECE

View full text Add to dashboard Cite

Cancer is one of the main causes of death in the world where the World Health Organization (WHO) recognized cancer as among the top causes of death in 2018. Thus, detecting cancer symptoms is paramount in order to cure and subsequently reduce the casualties due to cancer disease. Many studies have been developed data mining approaches to detect symptoms of cancer through a classifying human gene data expression. One popular approach is using microarray data based on DNA. However, DNA microarray data has many dimensions that can have a detrimental effect on the accuracy of classification. Therefore, before performing classification, a feature selection technique must be used to eliminate features that do not have important information to support the classification process. The feature selection techniques used were ReliefF and correlation-based feature selection (CFS) and a classification technique used in this study is support vector machine (SVM). Several testing schemes were applied in this analysis to compare the performance of ReliefF and CFS with SVM. It showed that the ReliefF outperformed compared with CFS as microarray data classification approach.

show abstract

“…e data often has lots of dimensions in some scopes, such as gene analyzing [27,28], cancer classification [29], robotics [30], satellite images processing [31], and big data [32][33][34], which makes feature selection technique essential.…”

Section: Related Workmentioning

confidence: 99%

Statistical Analysis of the Performance of Rank Fusion Methods Applied to a Homogeneous Ensemble Feature Ranking

Soheili

Moghadam

Dehghan

2020

Scientific Programming

View full text Add to dashboard Cite

The feature ranking as a subcategory of the feature selection is an essential preprocessing technique that ranks all features of a dataset such that many important features denote a lot of information. The ensemble learning has two advantages. First, it has been based on the assumption that combining different model’s output can lead to a better outcome than the output of any individual models. Second, scalability is an intrinsic characteristic that is so crucial in coping with a large scale dataset. In this paper, a homogeneous ensemble feature ranking algorithm is considered, and the nine rank fusion methods used in this algorithm are analyzed comparatively. The experimental studies are performed on real six medium datasets, and the area under the feature-forward-addition curve criterion is assessed. Finally, the statistical analysis by repeated-measures analysis of variance results reveals that there is no big difference in the performance of the rank fusion methods applied in a homogeneous ensemble feature ranking; however, this difference is a statistical significance, and the B-Min method has a little better performance.

show abstract

Distributed correlation-based feature selection in spark

Cited by 40 publications

References 30 publications

Classification and action rules in identification and self-care assessment problems

Classification and action rules in identification and self-care assessment problems

Comparative analysis of ReliefF-SVM and CFS-SVM for microarray data classification

Statistical Analysis of the Performance of Rank Fusion Methods Applied to a Homogeneous Ensemble Feature Ranking

Contact Info

Product

Resources

About