2020
DOI: 10.2197/ipsjjip.28.623
|View full text |Cite
|
Sign up to set email alerts
|

Identification of Cybersecurity Specific Content Using Different Language Models

Abstract: Given the sheer amount of digital texts publicly available on the Internet, it becomes more challenging for security analysts to identify cyber threat related content. In this research, we proposed to build an autonomous system to identify cyber threat information from publicly available information sources. We examined different language models to utilize as a cybersecurity-specific filter for the proposed system. Using the domain-specific training data, we trained Doc2Vec and BERT models and compared their p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
2
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(3 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…The implementation of cyber security-specific filters using text classification models has attracted some interest in recent years. A cyber security-related classification model [12] based on the BERT language representation model [13] reportedly achieved a precision of 0.92 and a recall of 0.90, in a test set comprising of multiple types of textual data, such as Reddit and Stack Exchange discussions, as well as more formal sources such as security news outlet RSS feeds. However, the part of their test set that contains no cyber security-related data consists only of Reddit discussions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The implementation of cyber security-specific filters using text classification models has attracted some interest in recent years. A cyber security-related classification model [12] based on the BERT language representation model [13] reportedly achieved a precision of 0.92 and a recall of 0.90, in a test set comprising of multiple types of textual data, such as Reddit and Stack Exchange discussions, as well as more formal sources such as security news outlet RSS feeds. However, the part of their test set that contains no cyber security-related data consists only of Reddit discussions.…”
Section: Related Workmentioning
confidence: 99%
“…The next step was to decide the number of classes in which the articles would be classified. One option would be to consider two classes similarly to [12]. As though our goal is to select only those pages that contain information useful for CTI extraction purposes, we opted to use the following three classes.…”
Section: Data Collection and Annotationmentioning
confidence: 99%
“…For instance, it has been used in research 1 https://github.com/AlessandroZ/LaZagne (see e.g., [20,21,22,23]) to track the rapid growth of novel malware and understand the nature of threats posed by them. Similarly, it has been used as a means of extracting structured information (such as TTPs) from large volumes of Cyber Threat Intelligence (CTI) reports, and converting them into actionable knowledge that can be exploited by analysts for cyber defence (see e.g., [24,25,26]). Finally, the MITRE ATT&CK framework has also been utilised as part of a larger pipeline to help with detection tasks, such as when systems incorporate knowledge from TTPs to detect malicious behaviour (see e.g., [27,28,29,30]).…”
Section: Introductionmentioning
confidence: 99%