Making queries tractable on big data with preprocessing

Fan, Wenfei; Geerts, Floris; Neven, Frank

doi:10.14778/2536360.2536368

Cited by 36 publications

(54 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, when Q is in CQ and constraints in Σ are expressed in CQ, RCDP is NEXPTIME-complete, while QDSI is Σ p 3 -complete. There has also been recent work on querying big data, e.g., on the communication complexity of parallel query evaluation [17,18], the complexity of query processing in terms of MapReduce rounds [2,30], and the study of query classes that are tractable on big data [13]. In contrast, this work studies whether it is feasible to compute query answers in big data by accessing a small subset of the data, and if so, how to efficiently identify this subset.…”

Section: Sufficient Conditions For Scale Independencementioning

confidence: 99%

On scale independence for querying big data

Fan

Geerts

Libkin

2014

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems

Self Cite

View full text Add to dashboard Cite

To make query answering feasible in big datasets, practitioners have been looking into the notion of scale independence of queries. Intuitively, such queries require only a relatively small subset of the data, whose size is determined by the query and access methods rather than the size of the dataset itself. This paper aims to formalize this notion and study its properties. We start by defining what it means to be scale-independent, and provide matching upper and lower bounds for checking scale independence, for queries in various languages, and for combined and data complexity.Since the complexity turns out to be rather high, and since scale-independent queries cannot be captured syntactically, we develop sufficient conditions for scale independence. We formulate them based on access schemas, which combine indexing and constraints together with bounds on the sizes of retrieved data sets. We then study two variations of scaleindependent query answering, inspired by existing practical systems. One concerns incremental query answering: we check when query answers can be maintained in response to updates scale-independently. The other explores scaleindependent query rewriting using views.

show abstract

Section: Sufficient Conditions For Scale Independencementioning

confidence: 99%

On scale independence for querying big data

Fan

Geerts

Libkin

2014

Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems

Self Cite

View full text Add to dashboard Cite

show abstract

“…al. [9], [10] formally defines the concept of query practicality through their definition of Π-tractability. A query Q is Π-tractable if there exists a polynomial-time algorithm to transform a data set D into D such that Q(D ) can be computed in polylog-time (O((log n) k ) for some k).…”

Section: Query Practicalitymentioning

confidence: 99%

When Good-Enough is Enough: Complex Queries at Fixed Cost

Mickulicz

Martins

Narasimhan

et al. 2015

2015 IEEE First International Conference on Big Data Computing Service and Applications

View full text Add to dashboard Cite

Collections of time-series data appear in a wide variety of contexts. To gain insight into the underlying phenomenon (that the data represents), one must analyze the time-series data. Analysis can quickly become challenging for very large data (∼terabytes or more) sets, and it may be infeasible to scan the entire data-set on each query due to time limits or resource constraints. To avoid this problem, one might pre-compute partial results by scanning the data-set (usually as the data arrives). However, for complex queries, where the value of a new data record depends on all of the data previously seen, this might be infeasible because incorporating a large amount of historical data into a query requires a large amount of storage.We present an approach to performing complex queries over very large data-sets in a manner that is (i) practical, meaning that a query does not require a scan of the entire data-set, and (ii) fixed-cost, meaning that the amount of storage required only depends on the time-range spanned by the entire data-set (and not the size of the data-set itself). We evaluate our approach with three different data-sets: (i) a 4-year commercial analytics data-set from a production content-delivery platform with over 15 million mobile users, (ii) an 18-year data-set from the Linuxkernel commit-history, and (iii) an 8-day data-set from Common Crawl HTTP logs. Our evaluation demonstrates the feasibility and practicality of our approach for a diverse set of complex queries on a diverse set of very large data-sets.

show abstract

“…To this end, we propose a notion of BD-tractable queries [8] , to help us determine what queries are tractable or feasible on big data.…”

Section: Querying Big Datamentioning

confidence: 99%

“…The revisions are defined in terms of computational costs [8] , communication (coordination) rounds [34][35] , or MapReduce steps [31] and data shipments [36] in the MapReduce framework [37] . Our notions of BD-tractability focus on computational costs [8] . The study is still preliminary, and a number of questions remain open.…”

Section: Open Issuesmentioning

confidence: 99%

Querying Big Data: Bridging Theory and Practice

Fan

Huai

2014

J. Comput. Sci. Technol.

Self Cite

View full text Add to dashboard Cite

Big data introduces challenges to query answering, from theory to practice. A number of questions arise. What queries are "tractable" on big data? How can we make big data "small" so that it is feasible to find exact query answers? When exact answers are beyond reach in practice, what approximation theory can help us strike a balance between the quality of approximate query answers and the costs of computing such answers? To get sensible query answers in big data, what else do we necessarily do in addition to coping with the size of the data? This position paper aims to provide an overview of recent advances in the study of querying big data. We propose approaches to tackling these challenging issues, and identify open problems for future research.

show abstract

Making queries tractable on big data with preprocessing

Cited by 36 publications

References 33 publications

On scale independence for querying big data

On scale independence for querying big data

When Good-Enough is Enough: Complex Queries at Fixed Cost

Querying Big Data: Bridging Theory and Practice

Contact Info

Product

Resources

About