File format vulnerabilities have been highlighted in recent years, and the performance of fuzzing tests relies heavily on the knowledge of target formats. In this paper, we present systematic algorithms and methods to automatically reverse engineer input file formats. The methodology employs dynamic taint analysis to reveal implicit relational information between input file and binary procedures, which is used for the measurement of correlations among data bytes, format segmentation and data type inference. We have implemented a prototype, and its general tests on 10 well-published binary formats yielded an average of over 85 % successful identification rate, while more detailed structural information was unveiled beyond coarse granular format analysis. Besides, a practical pseudo-fuzzing evaluation method is discussed in accordance with real-world demands on security analysis, and the evaluation results demonstrated the practical effectiveness of our system.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.