Using NLP techniques for file fragment classification

Fitzgerald, Simran; Mathews, George; Morris, Colin W.; Zhulyn, Oles

doi:10.1016/j.diin.2012.05.008

Cited by 79 publications

(59 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Due to compression algorithms, statistical properties of data cannot be used to classify the deflate-encoded data from different file formats. This fact is the reason why previous approaches that exploit statistical properties of compressed data as feature vectors brought low accurate rate [9,18]. Even from the empirical approach of [1], the authors took the advantage of compression properties such as Huffman table size, the detection rate is still low.…”

Section: Proposed Methodsmentioning

confidence: 95%

“…Therefore, they have been included in the data set for many research works. Most of current approaches provide low identification rates to file fragments of compound files which are less than 30 % as reported in [8,9]. Other works bring better identification rates but with much smaller size or small number of file types as considered in [10,11].…”

Section: Introductionmentioning

confidence: 91%

“…Thirdly, SVM with byte frequency distribution as feature vector can be used to efficiently identify these inflate data. Moreover, it is demonstrated that SVM can provide high accuracy and performance to recognize the data fragments which have low or medium entropy values [9,11]. In addition, it is showed that entropy-based clustering before feeding data into machine learning techniques might bring higher accuracy [15].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A New Approach to Compressed File Fragment Identification

Nguyen

Tran

et al. 2015

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

Identifying the underlying type of a file given only a file fragment is a big challenge in digital forensics. Many methods have been applied to file type identification; however the identification accuracies of most of file types are still very low, especially for files having complex structures because their contents are compound data built from different data types. In this paper, we propose a new approach based on the deflate-encoded data detection, entropy-based clustering, and the use of machine learning techniques to identify deflate-encoded file fragments. Experiments on the popular compound file type showed high identification accuracy for the proposed method.

show abstract

Section: Proposed Methodsmentioning

confidence: 95%

Section: Introductionmentioning

confidence: 91%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A New Approach to Compressed File Fragment Identification

Nguyen

Tran

et al. 2015

Advances in Intelligent Systems and Computing

View full text Add to dashboard Cite

show abstract

“…The statistical patterns denote quantitative features such as mean, variance and frequency, whereas the structural patterns denote morphological features such as syntactic grammar and interrelationship [21]. Recently, many researches of format classification involve statistical approaches as in [15][16][17][18]. Because of the rapid growth of the capacity of multimedia, many formats utilize compression methods to reduce the cost and lead to generate high entropy data.…”

Section: Format Feature Extractionmentioning

confidence: 99%

“…SVM is a powerful machine learning method because it is not limited by number of samples and dimensionality [11][12][13][14]. In [15][16][17][18], researchers used statistical features, such as mean, standard deviation, byte frequency distribution, Shannon entropy, N-gram and Hamming weight to classify the formats. Because of strong compression and entropy coding of audio file, it is hard to achieve high accuracy of classification only with statistical features.…”

Section: Introductionmentioning

confidence: 99%

Audio Fragment Identification System

Jin

Kim

2014

IJMUE

View full text Add to dashboard Cite

show abstract

Fragments‐Expert: A graphical user interface MATLAB toolbox for classification of file fragments

Teimouri

Seyedghorban

Amirjani

2020

Concurrency and Computation

View full text Add to dashboard Cite

Summary The classification of file fragments of various file formats is an essential task in various applications such as firewalls, intrusion detection systems, antiviruses, web content filtering, and digital forensics. However, the community lacks a suitable software tool that can integrate major methods for feature extraction from file fragments and classification among various file formats. In this article, we present Fragments‐Expert that is a graphical user interface MATLAB toolbox for the classification of file fragments. It provides users with 23 categories of features extracted from file fragments. These features can be employed by seven categories of machine learning algorithms for the task of classification among various file formats.

show abstract

Using NLP techniques for file fragment classification

Cited by 79 publications

References 4 publications

A New Approach to Compressed File Fragment Identification

A New Approach to Compressed File Fragment Identification

Audio Fragment Identification System

Fragments‐Expert: A graphical user interface MATLAB toolbox for classification of file fragments

Contact Info

Product

Resources

About