2018
DOI: 10.14778/3236187.3236207
|View full text |Cite
|
Sign up to set email alerts
|

Filter before you parse

Abstract: Exploratory big data applications often run on raw unstructured or semi-structured data formats, such as JSON files or text logs. These applications can spend 80-90% of their execution time parsing the data. In this paper, we propose a new approach for reducing this overhead: apply filters on the data's raw bytestream before parsing. This technique, which we call raw filtering, leverages the features of modern hardware and the high selectivity of queries found in many exploratory applications. With raw filteri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 49 publications
(12 citation statements)
references
References 25 publications
0
12
0
Order By: Relevance
“…This conjecture is corroborated by similar applications analyzing large JSON files or text logs. Two recent research studies estimated that they spend 80-90% of the execution time on parsing [32,46]. In comparison, Freeman et al [21] put this estimate to less than 40-70% when analyzing time-series data in binary encoding.…”
Section: Data Encoding: Textmentioning
confidence: 99%
See 2 more Smart Citations
“…This conjecture is corroborated by similar applications analyzing large JSON files or text logs. Two recent research studies estimated that they spend 80-90% of the execution time on parsing [32,46]. In comparison, Freeman et al [21] put this estimate to less than 40-70% when analyzing time-series data in binary encoding.…”
Section: Data Encoding: Textmentioning
confidence: 99%
“…Two standard data formats, Parquet [6] and Avro [7], have been developed from these projects and adopted as their default data persistence formats. Although not intended for this comparison, circumstantial evidence in [46] indicates that parsing data formatted in Parquet and Avro can be at least one order of magnitude faster than parsing the same data formatted in JSON.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…For both structured and semi-structured data, parsers such as FAD.js, Mison or SIMD-JSON use modern CPU properties for fast reads [12,27,37,39]. Raw filters are used to speed up the parsing and reduce the amount of data ingested into the database [48,63].…”
Section: Related Workmentioning
confidence: 99%
“…Such NICs are referred to as SmartNICs. Several preliminary investigations of SmartNIC technologies have demonstrated potential benefits for offloading networking stacks [2,10,30,31,32], network functions [3,18,25,43,4], key-value stores [7,26,28], packet schedulers [44], neural networks [42], and beyond [21,38]. Despite the increasing relevance of (smart) NICs in today's systems, very few studies have focused on dissecting the performance of SmartNICs, comparing them with their predecessors, and providing guidelines for deploying NIC-offloaded applications, with a focus on packet classification.…”
Section: Introductionmentioning
confidence: 99%