Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw 2020
DOI: 10.1145/3368089.3409679
|View full text |Cite
|
Sign up to set email alerts
|

Mining input grammars from dynamic control flow

Abstract: A program is characterized by its input model, and a formal input model can be of use in diverse areas including vulnerability analysis, reverse engineering, fuzzing and software testing, clone detection and refactoring. Unfortunately, input models for typical programs are often unavailable or out of date. While there exist algorithms that can mine the syntactical structure of program inputs, they either produce unwieldy and incomprehensible grammars, or require heuristics that target specific parsing patterns… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 38 publications
(9 citation statements)
references
References 52 publications
0
9
0
Order By: Relevance
“…This work is the first step in combining a data-flow analysis (e.g., the AUTOGRAM grammar extraction algorithm [17]) with control-flow analysis (e.g., the Mimid grammar extraction algorithm [15]), and extending the state of the art by incorporating compositional type information from labeled ground truth input. Existing approaches are limited to extracting context-free grammars from formats that have little or no dependent types.…”
Section: Initial Resultsmentioning
confidence: 99%
“…This work is the first step in combining a data-flow analysis (e.g., the AUTOGRAM grammar extraction algorithm [17]) with control-flow analysis (e.g., the Mimid grammar extraction algorithm [15]), and extending the state of the art by incorporating compositional type information from labeled ground truth input. Existing approaches are limited to extracting context-free grammars from formats that have little or no dependent types.…”
Section: Initial Resultsmentioning
confidence: 99%
“…The big cost of our approach is the necessity of a formal grammar for both parsing and producing-a cost that can boil down to 1-2 programmer days if a formal grammar is already part of the system (say, as an input file for parser generators), but also extend to weeks if it is not. In the future, we will be experimenting with approaches that mine grammars from input samples and programs [65], [66] with the goal of extending the resulting grammars with probabilities for probabilistic fuzzing.…”
Section: Discussionmentioning
confidence: 99%
“…Mining input structures [64], as exemplified using the above GLADE [60] and Learn&Fuzz [61] approaches, may assist in this task. AUTOGRAM [65] and MIMID [66] mine human-readable input grammars exploiting structure and identifiers of a program processing the input, which makes them particularly promising.…”
Section: Related Workmentioning
confidence: 99%
“…BIEBER also infers the structure of binary formats, but extends this to handle more complex features, such as different WAV/BMP types and variable-length segments. Gopinath et al [2020] learn context-free grammars by tracking accesses to the input buffer, as well as the control-flow of the original program, which is assumed to be a stack-based recursivedescent parser. BIEBER's inferrable file formats (Section 2.1) are not context-free.…”
Section: Language and Program Inferencementioning
confidence: 99%