Paraschos Koutris scite author profile

Data is increasingly being bought and sold online, and Webbased marketplace services have emerged to facilitate these activities. However, current mechanisms for pricing data are very simple: buyers can choose only from a set of explicit views, each with a specific price. In this paper, we propose a framework for pricing data on the Internet that, given the price of a few views, allows the price of any query to be derived automatically. We call this capability "querybased pricing." We first identify two important properties that the pricing function must satisfy, called arbitragefree and discount-free. Then, we prove that there exists a unique function that satisfies these properties and extends the seller's explicit prices to all queries. When both the views and the query are Unions of Conjunctive Queries, the complexity of computing the price is high. To ensure tractability, we restrict the explicit prices to be defined only on selection views (which is the common practice today). We give an algorithm with polynomial time data complexity for computing the price of any chain query by reducing the problem to network flow. Furthermore, we completely characterize the class of Conjunctive Queries without selfjoins that have PTIME data complexity (this class is slightly larger than chain queries), and prove that pricing all other queries is NP-complete, thus establishing a dichotomy on the complexity of the pricing problem when all views are selection queries.

show abstract

Skew in parallel query processing

Beame

Koutris

Suciu

2014

107

View full text Add to dashboard Cite

We study the problem of computing a conjunctive query q in parallel, using p of servers, on a large database. We consider algorithms with one round of communication, and study the complexity of the communication. We are especially interested in the case where the data is skewed, which is a major challenge for scalable parallel query processing. We establish a tight connection between the fractional edge packing of the query and the amount of communication in two cases. First, in the case when the only statistics on the database are the cardinalities of the input relations, and the data is skew-free, we provide matching upper and lower bounds (up to a polylogarithmic factor of p) expressed in terms of fractional edge packings of the query q. Second, in the case when the relations are skewed and the heavy hitters and their frequencies are known, we provide upper and lower bounds expressed in terms of packings of residual queries obtained by specializing the query to a heavy hitter. All our lower bounds are expressed in the strongest form, as number of bits needed to be communicated between processors with unlimited computational power. Our results generalize prior results on uniform databases (where each relation is a matching) [4], and lower bounds for the MapReduce model [1].

show abstract

Communication Steps for Parallel Query Processing

Beame

Koutris

Suciu

2017

J. ACM

View full text Add to dashboard Cite

We consider the problem of computing a relational query q on a large input database of size n, using a large number p of servers. The computation is performed in rounds, and each server can receive only O(n/p 1−ε ) bits of data, where ε ∈ [0, 1] is a parameter that controls replication. We examine how many global communication steps are needed to compute q. We establish both lower and upper bounds, in two settings. For a single round of communication, we give lower bounds in the strongest possible model, where arbitrary bits may be exchanged; we show that any algorithm requires ε ≥ 1 − 1/τ * , where τ * is the fractional vertex cover of the hypergraph of q. We also give an algorithm that matches the lower bound for a specific class of databases. For multiple rounds of communication, we present lower bounds in a model where routing decisions for a tuple are tuple-based. We show that for the class of tree-like queries there exists a tradeoff between the number of rounds and the space exponent ε. The lower bounds for multiple rounds are the first of their kind. Our results also imply that transitive closure cannot be computed in O(1) rounds of communication.than main memory access. In addition, any data reshuffling requires a global synchronization of all servers, which also comes at significant cost; for example, everyone needs to wait for the slowest server, and, worse, in the case of a straggler, or a local node failure, everyone must wait for the full recovery. Thus, the dominating complexity parameters in big data query processing are the number of communication steps, and the amount of data being exchanged.MapReduce-related models Several computation models have been proposed in order to understand the power of MapReduce and related massively parallel programming methods [9,16,17,1]. These all identify the number of communication steps/rounds as a main complexity parameter, but differ in their treatment of the communication.The first of these models was the MUD (Massive, Unordered, Distributed) model of Feldman et al. [9]. It takes as input a sequence of elements and applies a binary merge operation repeatedly, until obtaining a final result, similarly to a User Defined Aggregate in database systems. The paper compares MUD with streaming algorithms: a streaming algorithm can trivially simulate MUD, and the converse is also possible if the merge operators are computationally powerful (beyond PTIME).Karloff et al.[16] define MRC, a class of multi-round algorithms based on using the MapReduce primitive as the sole building block, and fixing specific parameters for balanced processing. The number of processors p is Θ(N 1− ), and each can exchange MapReduce outputs expressible in Θ(N 1− ) bits per step, resulting in Θ(N 2−2 ) total storage among the processors on a problem of size N. Their focus was algorithmic, showing simulations of other parallel models by MRC, as well as the power of two round algorithms for specific problems.Lower bounds for the single round MapReduce model are first discussed by Afrati et al....

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.