2015
DOI: 10.1007/978-3-319-15618-7_5
|View full text |Cite
|
Sign up to set email alerts
|

Are Your Training Datasets Yet Relevant?

Abstract: In this paper, we consider the relevance of timeline in the construction of datasets, to highlight its impact on the performance of a machine learning-based malware detection scheme. Typically, we show that simply picking a random set of known malware to train a malware detector, as it is done in many assessment scenarios from the literature, yields significantly biased results. In the process of assessing the extent of this impact through various experiments, we were also able to confirm a number of intuitive… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

1
44
1
4

Year Published

2016
2016
2022
2022

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 54 publications
(50 citation statements)
references
References 29 publications
1
44
1
4
Order By: Relevance
“…These results are particularly notable since previous work has demonstrated that machine learning-based Android malware detection was unable to obtain an F1 score higher than 70% in a time-aware scenario [56]. In that work, dates newer in time resulted in lower F1 scores; however, RevealDroid actually improves to as high as 99%.…”
Section: A31 Rq1: Detection Accuracymentioning
confidence: 72%
See 3 more Smart Citations
“…These results are particularly notable since previous work has demonstrated that machine learning-based Android malware detection was unable to obtain an F1 score higher than 70% in a time-aware scenario [56]. In that work, dates newer in time resulted in lower F1 scores; however, RevealDroid actually improves to as high as 99%.…”
Section: A31 Rq1: Detection Accuracymentioning
confidence: 72%
“…We grouped apps into two-year time periods, due to the fact that some years only have a few apps, mainly 2009 with 29 apps, and 2017 with 130 apps. Similar to [55], we consider the year of the last modified date of classes.dex in an app as the year from which it originates. We consider any transformed app as belonging to the same year as its original version, in order to determine the actual effect of obfuscation on product accuracy for each time period.…”
Section: Rq3 Time-aware Analysismentioning
confidence: 99%
See 2 more Smart Citations
“…In a time-agnostic scenario, training and testing as part of machine learning is conducted without considering the age of apps in the dataset. This scenario has been utilized to evaluate an overwhelming majority of machine-learning-based Android malware-detection approaches [13]. A time-aware scenario uses the modification date of apps to determine training and testing sets, which avoids training on apps from the future to test on apps from the past.…”
Section: Rq1: Detection Accuracymentioning
confidence: 99%