Detecting Trending Terms in Cybersecurity Forum Discussions

Hughes, Jack; Aycock, Seth; Caines, Andrew; Buttery, Paula; Hutchings, Alice

doi:10.18653/v1/2020.wnut-1.15

Cited by 14 publications

(12 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is significant interest surrounding the goal of being able to automate cybersecurity threat detection on social media [19,18,15,11,12,27,14,2]. Twitter, Reddit, and Stackexchange are popular forums from which several previous studies have gathered cybersecurity related documents [19,11,14,15,2,20] for the purpose of training machine learning detection systems and classifiers.…”

Section: Previous Workmentioning

confidence: 99%

“…Identifying cybersecurity discussions in open forums at scale is a topic of great interest for the purpose of mitigating and understanding modern cyber threats [12,19,22]. The challenge is that these discussions are typically quite noisy (i.e., they contain community known synonyms or acronyms or slang) and it is difficult to get labelled data in order to train resilient NLP (natural language processing) topic classifiers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Untitled

2022

IJNSA

View full text Add to dashboard Cite

In this research, we use user defined labels from three internet text sources (Reddit, StackExchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural English text. We analyze the false positive and false negative rates of each of the 21 model's in cross validation experiments. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.

show abstract

Section: Previous Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Untitled

2022

IJNSA

View full text Add to dashboard Cite

show abstract

“…There is significant interest surrounding the goal of being able to automate cybersecurity threat detection on social media [18,17,14,10,11,25,13,2]. Twitter, Reddit, and Stackexchange are popular forums from which several previous studies have gathered cybersecurity related documents [18,10,13,14,2,19] for the purpose of training machine learning detection systems and classifiers.…”

Section: Previous Workmentioning

confidence: 99%

“…There are several different approaches taken with which topic modelling task to use as a signal to detect cybersecurity discussions. Typically the topic classification task is related to training directly on labelled text and then perhaps developing an idea of the more relevant keywords in these discussions [18,11]. Other researchers use sentiment analysis in conjunction with other machine learning models [25,10].…”

Section: Previous Workmentioning

confidence: 99%

“…Identifying cybersecurity discussions in open forums at scale is a topic of great interest for the purpose of mitigating and understanding modern cyber threats [11,18,20]. The challenge is that often these discussions are quite noisy (i.e., they contain community known synonyms or acronyms) and difficult to get labelled data in order to train resilient NLP (natural language processing) topic classifiers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

An Enhanced Machine Learning Topic Classification Methodology for Cybersecurity

Pelofske¹,

Liebrock²,

Urias³

2021

Natural Language Processing

View full text Add to dashboard Cite

In this research, we use user defined labels from three internet text sources (Reddit, Stackexchange, Arxiv) to train 21 different machine learning models for the topic classification task of detecting cybersecurity discussions in natural text. We analyze the false positive and false negative rates of each of the 21 model’s in a cross validation experiment. Then we present a Cybersecurity Topic Classification (CTC) tool, which takes the majority vote of the 21 trained machine learning models as the decision mechanism for detecting cybersecurity related text. We also show that the majority vote mechanism of the CTC tool provides lower false negative and false positive rates on average than any of the 21 individual models. We show that the CTC tool is scalable to the hundreds of thousands of documents with a wall clock time on the order of hours.

show abstract