Features Engineering for Malware Family Classification Based API Call

Daeef, Ammar Yahya; Al-Naji, Ali; Chahl, Javaan

doi:10.3390/computers11110160

Cited by 9 publications

(7 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The first five methods are classic methods [14,[44][45][46][47] to do the malware family classification, and we report the results from their papers. The following five methods [16,20,21,23,48] are the latest effective work on the classification based on API calls, so we reproduce the methods and offer a convincing comparison result. The [21] method adopts a two-way feature extraction architecture for API calls, but the core module is a multi-layer CNN, and the correlation analysis is performed through Bi-LSTM.…”

Section: Comparison With Previous Methodsmentioning

confidence: 99%

“…The results of their endeavors demonstrate significant performance enhancements when compared to baseline methodologies, highlighting the efficacy of introducing additional intrinsic features associated with APIs. Some works consider the similarity among the features, especially API call sequences, and employ similarity to do the encoder, followed by some advanced models such as GNN [22], Random Forest, LSTM [23], and F-RCNN [24].…”

Section: Deep Learning-based or Api-call-related Malware Classificationmentioning

confidence: 99%

“…Malware Image + GIST [44] File content 63,002 531 0.7280 Malware Image + CNN [45] File content 10,868 9 0.9176 Malware Image + GRU-SVM [46] File content 9339 25 0.8492 BBIS + CARL [47] API calls 3131 28 0.8840 (F1) NLP(TF-IDF) + SVM [14] API calls 23,080 10 0.8654 Category Vector + CNN [16] API calls 23,080 10 0.8797 Frequence Vector + RF [23] API In Table 5, we further compared the performance of the latest seven methods on the second dataset to demonstrate the robustness and broad performance of the model. From Table 5, the performance of all three models has decreased on the second dataset, but our proposed method still performs the best.…”

Section: Features Samples Families Accuracymentioning

confidence: 99%

See 2 more Smart Citations

TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

Wang,

Lin,

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

The surge in malware threats propelled by the rapid evolution of the internet and smart device technology necessitates effective automatic malware classification for robust system security. While existing research has primarily relied on some feature extraction techniques, issues such as information loss and computational overhead persist, especially in instruction-level tracking. To address these issues, this paper focuses on the nuanced analysis of API (Application Programming Interface) call sequences between the malware and system and introduces TTDAT (Two-step Training Dual Attention Transformer) for malware classification. TTDAT utilizes Transformer architecture with original multi-head attention and an integrated local attention module, streamlining the encoding of API sequences and extracting both global and local patterns. To expedite detection, we introduce a two-step training strategy: ensemble Transformer models to generate class representation vectors, thereby bolstering efficiency and adaptability. Our extensive experiments demonstrate TTDAT’s effectiveness, showcasing state-of-the-art results with an average F1 score of 0.90 and an accuracy of 0.96.

show abstract

Section: Comparison With Previous Methodsmentioning

confidence: 99%

Section: Deep Learning-based or Api-call-related Malware Classificationmentioning

confidence: 99%

Section: Features Samples Families Accuracymentioning

confidence: 99%

See 1 more Smart Citation

TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

Wang,

Lin,

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…Hansen et al [26] employed API call sequences and frequency to identify and classify malware by utilising the Random Forest classifier. Daeef et al [27] proposed a method to uncover the underlying patterns of malicious behaviour among different malware families by utilising the Jaccard index and visualisation techniques. J. Singh et al [28] and Albishry et al [29] explained how ML techniques have been widely utilised in the field of malware detection.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Malware Classification and API Categorisation of Windows Portable Executable Files Using Machine Learning

Syeda,

Asghar

2024

Applied Sciences

View full text Add to dashboard Cite

The rise of malware attacks presents a significant cyber-security challenge, with advanced techniques and offline command-and-control (C2) servers causing disruptions and financial losses. This paper proposes a methodology for dynamic malware analysis and classification using a malware Portable Executable (PE) file from the MalwareBazaar repository. It suggests effective strategies to mitigate the impact of evolving malware threats. For this purpose, a five-level approach for data management and experiments was utilised: (1) generation of a customised dataset by analysing a total of 582 malware and 438 goodware samples from Windows PE files; (2) feature extraction and feature scoring based on Chi2 and Gini importance; (3) empirical evaluation of six state-of-the-art baseline machine learning (ML) models, including Logistic Regression (LR), Support Vector Machine (SVM), Naive Bayes (NB), Random Forest (RF), XGBoost (XGB), and K-Nearest Neighbour (KNN), with the curated dataset; (4) malware family classification using VirusTotal APIs; and, finally, (5) categorisation of 23 distinct APIs from 266 malware APIs. According to the results, Gini’s method takes a holistic view of feature scoring, considering a wider range of API activities. The RF achieved the highest precision of 0.99, accuracy of 0.96, area under the curve (AUC) of 0.98, and F1-score of 0.96, with a 0.93 true-positive rate (TPR) and 0.0098 false-positive rate (FPR), among all applied ML models. The results show that Trojans (27%) and ransomware (22%) are the most risky among 11 malware families. Windows-based APIs (22%), the file system (12%), and registry manipulation (8.2%) showcased their importance in detecting malicious activity in API categorisation. This paper considers a dual approach for feature reduction and scoring, resulting in an improved F1-score (2%), and the inclusion of AUC and specificity metrics distinguishes it from existing research (Section Comparative Analysis with Existing Approaches). The newly generated dataset is publicly available in the GitHub repository (Data Availability Statement) to facilitate aspirant researchers’ dynamic malware analysis.

show abstract

“…The results of a recent research [1,2] carried out by AV-TEST reveal that more than 9 million new instances of malicious software have been launched, and that there are presently 1363.92 million detected instances of malicious software functioning in the environment. These results underline the need for significant and continuing technological improvement in order to avoid the emergence of new dangers.…”

Section: Introductionmentioning

confidence: 99%

An Investigation of Quantum and Parallel Computing Effects on Malware Families Classification

Taha

2023

JASTT

View full text Add to dashboard Cite

The proliferation of malicious software is a major concern for organizations and consumers alike. Malware is used to compromise computer systems and networks for malevolent purposes. Consequently, categorizing malware is essential for safeguarding systems from harmful assaults. Developers of malicious software are always coming up with novel techniques to avoid detection by security researchers. However, in recent years, quantum computing has developed rapidly and shown considerable advantages in a number of sectors, particularly in the area of cybersecurity. A quantum approach may be useful in conjunction with existing software for finding the most often occurring hashes and n-grams that are characteristic of malicious software. The time it takes to map n-grams to their hashes may be reduced if we load the table of hashes and n-grams into a quantum computer. The first step is to utilize Kilogram to identify the most prevalent hashes and n-grams in a large collection of malware. Once the hash table is generated, it is sent into a quantum simulator. The entangled key-value pairs are then searched through a quantum search method to locate the appropriate hash value. In contrast to the quantum algorithm's potential runtime of O(N) in the number of table lookups required to get the requisite hash values, re-computing hashes for a set of n-grams may take on average O(MN) time. The main purpose of this research is to address the significant effects of quantum and parallel computing on malware families’ classification.

show abstract

Features Engineering for Malware Family Classification Based API Call

Cited by 9 publications

References 30 publications

TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

TTDAT: Two-Step Training Dual Attention Transformer for Malware Classification Based on API Call Sequences

Dynamic Malware Classification and API Categorisation of Windows Portable Executable Files Using Machine Learning

An Investigation of Quantum and Parallel Computing Effects on Malware Families Classification

Contact Info

Product

Resources

About