2019
DOI: 10.1016/j.ins.2018.10.052
|View full text |Cite
|
Sign up to set email alerts
|

Distributed correlation-based feature selection in spark

Abstract: Feature selection (FS) is a key preprocessing step in data mining. CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster comp… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 40 publications
(20 citation statements)
references
References 30 publications
0
18
0
Order By: Relevance
“…We can say that this heuristic is the core concept of the CFS algorithm. It is a filtering method that applies a principle derived from Ghiselly test theory -good subsets of features contain features highly correlated with the class but uncorrelated with each other [12][13][14][15]. The CFS feature subset evaluation function is defined as [12,14]:…”
Section: Methodsmentioning
confidence: 99%
“…We can say that this heuristic is the core concept of the CFS algorithm. It is a filtering method that applies a principle derived from Ghiselly test theory -good subsets of features contain features highly correlated with the class but uncorrelated with each other [12][13][14][15]. The CFS feature subset evaluation function is defined as [12,14]:…”
Section: Methodsmentioning
confidence: 99%
“…where is subset feature value, is number of features, ̅̅̅̅ is average value of class minus the feature correlation, and ̅ is average value of feature minus the feature intercorrelation [6].…”
Section: The Cfs Techniquementioning
confidence: 99%
“…ReliefF gives a ranking value to each feature against its class attributes; the features with the highest weight will positively impact the classification process. Meanwhile, the CFS helps assess whether a subset of features uses merit_s calculations based on the correlation between features and classes, as well as the correlation between features with other features; the greater the merit_s value of a subset, the better its impact on the classification process [6]. The support vector machine (SVM) classification technique was chosen because it can produce better accuracy with microarray data compared to several other classification techniques [7][8][9].…”
Section: Introductionmentioning
confidence: 99%
“…e data often has lots of dimensions in some scopes, such as gene analyzing [27,28], cancer classification [29], robotics [30], satellite images processing [31], and big data [32][33][34], which makes feature selection technique essential.…”
Section: Related Workmentioning
confidence: 99%