2008 IEEE International Conference on Intelligence and Security Informatics 2008
DOI: 10.1109/isi.2008.4565046
|View full text |Cite
|
Sign up to set email alerts
|

Unknown malcode detection via text categorization and the imbalance problem

Abstract: Today's signature-based anti-viruses are very accurate, but are limited in detecting new malicious code. Currently, dozens of new malicious codes are created every day, and this number is expected to increase in the coming years. Recently, classification algorithms were used successfully for the detection of unknown malicious code. These studies used a test collection with a limited size where the same malicious-benign-file ratio in both the training and test sets, which does not reflect reallife conditions. I… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
81
3

Year Published

2008
2008
2022
2022

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 82 publications
(84 citation statements)
references
References 10 publications
0
81
3
Order By: Relevance
“…We performed an extensive evaluation of a test collection of more than 30,000 files, in which we evaluated extensively the OpCode n-gram representation and investigated the imbalance problem, referring to real-life scenarios, in which the malicious file content is expected to be about 10% of the total files. Our results indicate that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in our previous experience with byte sequence n-gram representation [1]. …”
contrasting
confidence: 56%
See 4 more Smart Citations
“…We performed an extensive evaluation of a test collection of more than 30,000 files, in which we evaluated extensively the OpCode n-gram representation and investigated the imbalance problem, referring to real-life scenarios, in which the malicious file content is expected to be about 10% of the total files. Our results indicate that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in our previous experience with byte sequence n-gram representation [1]. …”
contrasting
confidence: 56%
“…A classifier is a rule set that is learnt from a given training set, including examples of classes, both malicious and benign files in our case. Recent studies, which we survey in the next section, and our experience [1], have shown that using byte sequence ngrams to represent the binary files yields very good results. A recent survey 1 done by McAfee indicates that about 4% of search results from the major search engines on the web contain malicious code.…”
Section: Introductionmentioning
confidence: 94%
See 3 more Smart Citations