2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2016
DOI: 10.1109/dsaa.2016.26
|View full text |Cite
|
Sign up to set email alerts
|

Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections

Abstract: For very large datasets, random projections (RP) have become the tool of choice for dimensionality reduction. This is due to the computational complexity of principal component analysis. However, the recent development of randomized principal component analysis (RPCA) has opened up the possibility of obtaining approximate principal components on very large datasets. In this paper, we compare the performance of RPCA and RP in dimensionality reduction for supervised learning. In Experiment 1, study a malware cla… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
1
1

Relationship

3
3

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 14 publications
0
8
0
Order By: Relevance
“…This dataset [15] consists of 4,608,517 portable executable files and determined to be either malicious or clean. Each file is represented as 98,450 features, mostly binary, with mean density 0.0244.…”
Section: A Data and Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…This dataset [15] consists of 4,608,517 portable executable files and determined to be either malicious or clean. Each file is represented as 98,450 features, mostly binary, with mean density 0.0244.…”
Section: A Data and Methodsmentioning
confidence: 99%
“…To see this, we will express X = f f T X using constructions 6 in ( 6) and ( 8) and show: a) For points in R m , the operation f f T (which maps X to X) will leave points in I unchanged, and will map all points in I ⊥ to 0 [see ( 14), (15)]. Since I ≈ im(X) by Step 1, most points u ∈ im(X) will be well-approximated by component u a , where u = u a + u b and u a ∈ I, u b ∈ I ⊥ .…”
Section: A Overviewmentioning
confidence: 99%
“…As recommended by [7], the density of the random projection matrix was set to 1/ √ P ≈ .003. We set K = 5000 based on the results of [14]. Obtaining influence scores for all 2.3 million samples took about an hour and a half using non-optimized scripts written in the Julia programming language and executed on a single r3.8x instance (with 32 workers for parallel processing) in Amazon's EC2 system.…”
Section: Methodsmentioning
confidence: 99%
“…The above paper uses random projections, as introduced by Dahl et al [7], to reduce the dimensionality of input features in an unsupervised fashion (projecting 50,000 features into 4,000). Wojnowicz et al [20] improved on this approach by introducing randomized principal component analysis. To our knowledge, their data set (11.7M malware samples) was the largest to date used in malware classification or clustering research.…”
Section: Related Workmentioning
confidence: 99%