Identification of Cybersecurity Specific Content Using Different Language Models

Mendsaikhan, Otgonpurev; Hasegawa, Hirokazu; Yamaguchi, Yukiko; Shimada, Hiroyuki; Bataa, Enkhbold

doi:10.2197/ipsjjip.28.623

Cited by 6 publications

(3 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The implementation of cyber security-specific filters using text classification models has attracted some interest in recent years. A cyber security-related classification model [12] based on the BERT language representation model [13] reportedly achieved a precision of 0.92 and a recall of 0.90, in a test set comprising of multiple types of textual data, such as Reddit and Stack Exchange discussions, as well as more formal sources such as security news outlet RSS feeds. However, the part of their test set that contains no cyber security-related data consists only of Reddit discussions.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Towards Selecting Informative Content for Cyber Threat Intelligence

Panagiotou

Iliou

Apostolou

et al. 2021

2021 IEEE International Conference on Cyber Security and Resilience (CSR)

View full text Add to dashboard Cite

Nowadays, there is an increasing need for cyber security professionals to make use of tools that automatically extract Cyber Threat Intelligence (CTI) relying on information collected from relevant blogs and news sources that are publicly available. When such sources are used, an important part of the CTI extraction process is content selection, in which pages that do not contain CTI-related information should be filtered out. For this task, we apply supervised machine learning-based text classification techniques, trained on a new dataset created for the purposes of this work. Furthermore, we show in practice the importance of a good content selection process in a commonly used CTI extraction pipeline, by inspecting the results of the named entity recognition (NER) process that normally follows.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The next step was to decide the number of classes in which the articles would be classified. One option would be to consider two classes similarly to [12]. As though our goal is to select only those pages that contain information useful for CTI extraction purposes, we opted to use the following three classes.…”

Section: Data Collection and Annotationmentioning

confidence: 99%

Towards Selecting Informative Content for Cyber Threat Intelligence

Panagiotou

Iliou

Apostolou

et al. 2021

2021 IEEE International Conference on Cyber Security and Resilience (CSR)

View full text Add to dashboard Cite

show abstract

“…For instance, it has been used in research 1 https://github.com/AlessandroZ/LaZagne (see e.g., [20,21,22,23]) to track the rapid growth of novel malware and understand the nature of threats posed by them. Similarly, it has been used as a means of extracting structured information (such as TTPs) from large volumes of Cyber Threat Intelligence (CTI) reports, and converting them into actionable knowledge that can be exploited by analysts for cyber defence (see e.g., [24,25,26]). Finally, the MITRE ATT&CK framework has also been utilised as part of a larger pipeline to help with detection tasks, such as when systems incorporate knowledge from TTPs to detect malicious behaviour (see e.g., [27,28,29,30]).…”

Section: Introductionmentioning

confidence: 99%

To TTP or not to TTP?: Exploiting TTPs to Improve ML-based Malware Detection

Sharma,

Giunchiglia,

Birnbach

et al. 2023

2023 IEEE International Conference on Cyber Security and Resilience (CSR)

View full text Add to dashboard Cite

In the last decade, machine learning (ML) methods have increasingly been applied to the task of malware detection. While these approaches have surely demonstrated their effectiveness, they still present limitations, some of which are a consequence of their purely data-driven nature. In this paper, we show how the MITRE ATT&CK framework of tactics, techniques, and procedures (TTPs) can be exploited to overcome such limitations and improve their ability to detect malware on networks. We conduct an extensive experimental analysis, testing 7 ML models on 5 large datasets comprising over 37 million flows. Our results clearly demonstrate that adding TTP-based features for training the models robustly improves their performance. Our models outperform the standard ones 922 times out of a total of 952, (i.e., 96.8% of the time), with the biggest improvements (up to 84.9% in terms of FPR) being observed in situations designed to be challenging for ML models.

show abstract

Content and interaction-based mapping of Reddit posts related to information security

Charmanas,

Mittas,

Angelis

2024

J Comput Soc Sc

View full text Add to dashboard Cite

Ensuring the privacy and safety of platform users has become a complex objective due to the emerging threats that surround any type of network, software, and hardware. Scams, malwares, hackers, and security vulnerabilities form the epicenter of cyber threats causing severe damage to the affected systems and sensitive data of users. Thus, users turn to online social networks to report cyber threats, discuss topics of their interest, and obtain knowledge concerning the various perspectives of information security. In this study, we aim to address the concepts of social interactions surrounding information security-related content by retrieving and analyzing Reddit posts from 45 relevant subreddits. In this regard, a word clustering approach is employed, based on the Affinity Propagation algorithm, that leads to the extraction and interpretation of 54 concepts. These concepts are relevant to information security and some more generic areas of interest including social media, software vendors, and labors. Furthermore, to provide a more comprehensive overview of users’ activity in the different Reddit communities/subreddits, a knowledge map associating subreddits and concepts based on their conceptual similarities is also established. The analysis shows that the descriptions of the examined subreddits are strongly related to their underlying concepts. At the same time, the outcomes also assess the conceptual associations between the different subreddits, offering knowledge related to similar and distant communities. Ultimately, two post metrics are utilized to explore how the concepts may impact user interactions. This allows us to differentiate between concepts associated with posts typically endorsed by communities, resulting in increased information exchange (via comments), or contributing as news/announcements. Overall, the findings of this study can be used as a knowledge basis in determining user interests, opinions, perspectives, and responsiveness, when it comes to cyber threats, attacks, and malicious activities. Also, the respective outcomes can contribute as a guide for identifying similar communities/subreddits and themes. Regarding the methodological contributions of this study, the proposed framework can be adapted to similar datasets and research goals as it does not depend on the special characteristics of the imported data, offering, in turn, a practical approach for future research.

show abstract

Identification of Cybersecurity Specific Content Using Different Language Models

Cited by 6 publications

References 17 publications

Towards Selecting Informative Content for Cyber Threat Intelligence

Towards Selecting Informative Content for Cyber Threat Intelligence

To TTP or not to TTP?: Exploiting TTPs to Improve ML-based Malware Detection

Content and interaction-based mapping of Reddit posts related to information security

Contact Info

Product

Resources

About