Mining input grammars from dynamic control flow

Gopinath, Rahul; Mathis, Björn; Zeller, Andreas

doi:10.1145/3368089.3409679

Cited by 38 publications

(9 citation statements)

References 52 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work is the first step in combining a data-flow analysis (e.g., the AUTOGRAM grammar extraction algorithm [17]) with control-flow analysis (e.g., the Mimid grammar extraction algorithm [15]), and extending the state of the art by incorporating compositional type information from labeled ground truth input. Existing approaches are limited to extracting context-free grammars from formats that have little or no dependent types.…”

Section: Initial Resultsmentioning

confidence: 99%

Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

Cowger

Lee

Schimanski

et al. 2020

2020 IEEE Security and Privacy Workshops (SPW)

View full text Add to dashboard Cite

When a data format achieves a significant level of adoption, the presence of multiple format implementations expands the original specification in often-unforeseen ways. This results in an implicitly defined, de facto format, which can create vulnerabilities in programs handling the associated data files. In this paper we present our initial work on ICARUS: a toolchain for dealing with the problem of understanding and hardening de facto file formats. We show the results of our work in progress in the following areas: labeling and categorizing a corpora of data format samples to understand accepted variations of a format; the detection of sublanguages within the de facto format using both entropy-and taint-tracking-based methods, as a means of breaking down the larger problem of learning how the grammar has evolved; grammar inference via reinforcement learning, as a means of tying together the learned sublanguages; and the defining of both safe subsets of the de facto grammar, as well as translations from unsafe regions of the de facto grammar into safe regions. Real-world data formats evolve as they find use in real-world applications, and a comprehensive ICARUS toolchain for understanding and hardening the resulting de facto formats can identify and address security risks arising from this evolution. 327 2020 Symposium on Security and Privacy Workshops (SPW)

show abstract

Section: Initial Resultsmentioning

confidence: 99%

Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

Cowger

Lee

Schimanski

et al. 2020

2020 IEEE Security and Privacy Workshops (SPW)

View full text Add to dashboard Cite

show abstract

“…The big cost of our approach is the necessity of a formal grammar for both parsing and producing-a cost that can boil down to 1-2 programmer days if a formal grammar is already part of the system (say, as an input file for parser generators), but also extend to weeks if it is not. In the future, we will be experimenting with approaches that mine grammars from input samples and programs [65], [66] with the goal of extending the resulting grammars with probabilities for probabilistic fuzzing.…”

Section: Discussionmentioning

confidence: 99%

“…Mining input structures [64], as exemplified using the above GLADE [60] and Learn&Fuzz [61] approaches, may assist in this task. AUTOGRAM [65] and MIMID [66] mine human-readable input grammars exploiting structure and identifiers of a program processing the input, which makes them particularly promising.…”

Section: Related Workmentioning

confidence: 99%

Inputs From Hell:

Soremekun

Pavese

Havrikov

et al. 2022

IIEEE Trans. Software Eng.

Self Cite

View full text Add to dashboard Cite

Grammars can serve as producers for structured test inputs that are syntactically correct by construction. A probabilistic grammar assigns probabilities to individual productions, thus controlling the distribution of input elements. Using the grammars as input parsers, we show how to learn input distributions from input samples, allowing to create inputs that are similar to the sample; by inverting the probabilities, we can create inputs that are dissimilar to the sample. This allows for three test generation strategies: 1) "Common inputs" -by learning from common inputs, we can create inputs that are similar to the sample; this is useful for regression testing. 2) "Uncommon inputs" -learning from common inputs and inverting probabilities yields inputs that are strongly dissimilar to the sample; this is useful for completing a test suite with "inputs from hell" that test uncommon features, yet are syntactically valid. 3) "Failure-inducing inputs" -learning from inputs that caused failures in the past gives us inputs that share similar features and thus also have a high chance of triggering bugs; this is useful for testing the completeness of fixes. Our evaluation on three common input formats (JSON, JavaScript, CSS) shows the effectiveness of these approaches. Results show that "common inputs" reproduced 96% of the methods induced by the samples. In contrast, for almost all subjects (95%), the "uncommon inputs" covered significantly different methods from the samples. Learning from failure-inducing samples reproduced all exceptions (100%) triggered by the failure-inducing samples and discovered new exceptions not found in any of the samples learned from.

show abstract

“…BIEBER also infers the structure of binary formats, but extends this to handle more complex features, such as different WAV/BMP types and variable-length segments. Gopinath et al [2020] learn context-free grammars by tracking accesses to the input buffer, as well as the control-flow of the original program, which is assumed to be a stack-based recursivedescent parser. BIEBER's inferrable file formats (Section 2.1) are not context-free.…”

Section: Language and Program Inferencementioning

confidence: 99%

Inferring Drop-in Binary Parsers from Program Executions

Dang,

Cambronero,

Rinard

2021

Preprint

View full text Add to dashboard Cite

We present BIEBER (Byte-IdEntical Binary parsER), the first system to model and regenerate a full working parser from instrumented program executions. To achieve this, BIEBER exploits the regularity (e.g., header fields and array-like data structures) that is commonly found in file formats. Key generalization steps derive strided loops that parse input file data and rewrite concrete loop bounds with expressions over input file header bytes. These steps enable BIEBER to generalize parses of specific input files to obtain parsers that operate over input files of arbitrary size. BIEBER also incrementally and efficiently infers a decision tree that reads file header bytes to route input files of different types to inferred parsers of the appropriate type. The inferred parsers and decision tree are expressed in an intermediate language that is independent of the original program; separate backends (C and Perl in our prototype) can translate the intermediate representation into the same language as the original program (for a safer drop-in replacement), or automatically port to a different language. An empirical evaluation shows that BIEBER can successfully regenerate parsers for six file formats (waveform audio [1654 files], MT76x0 .BIN firmware containers [5 files], OS/2 1.x bitmap images [9 files], Windows 3.x bitmaps [9971 files], Windows 95/NT4 bitmaps [133 files], and Windows 98/2000 bitmaps [859 files]), correctly parsing 100% (≥ 99.98% when using standard held-out cross-validation) of the corresponding corpora. The regenerated parsers contain automatically inserted safety checks that eliminate common classes of errors such as memory errors. We find that BIEBER can help reverse-engineer file formats, because it automatically identifies predicates for the decision tree that relate to key semantics of the file format. We also discuss how BIEBER helped us detect and fix two new bugs in stb_image as well as independently rediscover and fix a known bug.

show abstract

Mining input grammars from dynamic control flow

Cited by 38 publications

References 52 publications

Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

Research Report: ICARUS: Understanding De Facto Formats by Way of Feathers and Wax

Inputs From Hell:

Inferring Drop-in Binary Parsers from Program Executions

Contact Info

Product

Resources

About