Unsupervised field segmentation of unknown protocol messages

Sun, Fanghui; Shen, Wang; Zhang, Chunrui; Zhang, Hongli

doi:10.1016/j.comcom.2019.06.013

Cited by 15 publications

(4 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, ASAP [17] maps the message payloads to the vector space by constructing the marked letters derived from the separator and n-gram, and uses matrix factorization [34] to identify the basic direction and coordinate tuples to cluster different protocol messages. Sun et al [25] defines Token Format Distance (TFD) and Message Format Distance (MFD) by introducing basic rules of Augmented Backus Naur Form (ABNF) [35] to calculate protocol message distances, then uses the DBSCAN algorithm [36] to cluster protocol messages, and uses Silhouette Coefficient and Dunn Validity Index [37] to determine the best clustering parameters to improve the quality of clustering performance.…”

Section: Related Workmentioning

confidence: 99%

“…Therefore, in this paper, we mainly divide a variety of unknown protocol messages into different clusters, which will facilitate future protocol reverse work. Effective protocol message clustering necessitates the resolution of two critical issues: the measurement of protocol message distance and the design of clustering algorithm [25]. It is worth noting that the message distance is the basis of protocol clustering.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Unsupervised Binary Protocol Clustering Based on Maximum Sequential Patterns

Shi¹,

Ye²,

Li³

et al. 2022

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

With the rapid development of the Internet, a large number of private protocols emerge on the network. However, some of them are constructed by attackers to avoid being analyzed, posing a threat to computer network security. The blockchain uses the P2P protocol to implement various functions across the network. Furthermore, the P2P protocol format of blockchain may differ from the standard format specification, which leads to sniffing tools such as Wireshark and Fiddler not being able to recognize them. Therefore, the ability to distinguish different types of unknown network protocols is vital for network security. In this paper, we propose an unsupervised clustering algorithm based on maximum frequent sequences for binary protocols, which can distinguish various unknown protocols to provide support for analyzing unknown protocol formats. We mine the maximum frequent sequences of protocol message sets in bytes. And we calculate the fuzzy membership of the protocol message to each maximum frequent sequence, which is based on fuzzy set theory. Then we construct the fuzzy membership vector for each protocol message. Finally, we adopt K-means++ to split different types of protocol messages into several clusters and evaluate the performance by calculating homogeneity, integrity, and Fowlkes and Mallows Index (FMI). Besides, the clustering algorithms based on Needleman-Wunsch and the fixed-length prefix are compared with the algorithm presented in this paper. Compared with these traditional clustering methods, we demonstrate a certain improvement in the clustering performance of our work.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Unsupervised Binary Protocol Clustering Based on Maximum Sequential Patterns

Shi¹,

Ye²,

Li³

et al. 2022

Computer Modeling in Engineering &Amp; Sciences

View full text Add to dashboard Cite

show abstract

“…Netzob [3], Discoverer [5], and others [15] deduce fields as a by-product of sequence alignment with the already mentioned disadvantages. Existing statistical methods either require an already existing segmentation [3,5,15] or expect field boundaries at globally fixed positions [2,26,27], limiting the applicability to protocols specifically designed without variable length fields. If meta-data and common offsets of values in messages are available, the task is as simple as finding the corresponding or correlating values in the messages.…”

Section: Related Workmentioning

confidence: 99%

Refining Network Message Segmentation with Principal Component Analysis

Kleber

Kargl

2022

2022 IEEE Conference on Communications and Network Security (CNS)

View full text Add to dashboard Cite

Reverse engineering of undocumented protocols is a common task in security analyses of networked services. The communication itself, captured in traffic traces, contains much of the necessary information to perform such a protocol reverse engineering. The comprehension of the format of unknown messages is of particular interest for binary protocols that are not human-readable. One major challenge is to discover probable fields in a message as the basis for further analyses. Given a set of messages, split into segments of bytes by an existing segmenter, we propose a method to refine the approximation of the field inference. We use principle component analysis (PCA) to discover linearly correlated variance between sets of message segments. We relocate the boundaries of the initial coarse segmentation to more accurately match with the true fields. We perform different evaluations of our method to show its benefit for the message format inference and subsequent analysis tasks from literature that depend on the message format. We can achieve a median improvement of the message format accuracy across different real-world protocols by up to 100 %.

show abstract

“…The general process consists of three phases: syntax inference, semantics inference, and state machine inference, which represents the order in which message types are transmitted. For semantics and state machine inference to be successful, syntax inference must be performed correctly, and accurate keyword extraction is crucial for accomplishing correct syntax inference 4–9 . In this paper, we use the term “keyword” to refer to a value that one field can have; accurate keyword extraction refers to the process of extracting values that exactly one field can have, not noise such as a combination of values from two or more fields or a portion of the value of one field.…”

Section: Introductionmentioning

confidence: 99%

A message keyword extraction approach by accurate identification of field boundaries

Goo

Shim

Lee³

et al. 2020

Int J Network Mgmt

View full text Add to dashboard Cite

Summary With the recent exponential increase in internet speeds, the traditional network environment is evolving into a high‐capacity network environment. Network traffic usage is also increasing exponentially, as are new malicious behaviors and related applications. Most of these applications and malicious behaviors use unknown protocols for which the structure is inaccessible; hence, protocol reverse engineering is receiving increasing attention in the field of network management. Various approaches have been proposed, but they still suffer from misidentification of field boundaries. To understand message structures properly, it is important to identify accurately the boundaries of the fields constituting the protocol message; accurate keyword extraction based on this approach leads to the correct inference of message types, semantics, and state machine. In this study, we propose a message keyword extraction method using accurate identification of field boundaries from delimiter inference and statistical analysis. Through the identification of field boundaries, messages can be subdivided into fields. We evaluate the efficacy of the proposed method by applying it to several textual and binary protocols. The proposed method showed better results than did other previous studies for both textual and binary protocols.

show abstract

Unsupervised field segmentation of unknown protocol messages

Cited by 15 publications

References 12 publications

Unsupervised Binary Protocol Clustering Based on Maximum Sequential Patterns

Unsupervised Binary Protocol Clustering Based on Maximum Sequential Patterns

Refining Network Message Segmentation with Principal Component Analysis

A message keyword extraction approach by accurate identification of field boundaries

Contact Info

Product

Resources

About