AutoPart: automating schema design for large scientific databases using data partitioning

Papadomanolakis, Stratos; Ailamaki, Anastasia

doi:10.1109/ssdm.2004.1311234

Cited by 58 publications

(68 citation statements)

References 16 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Microsoft's AutoAdmin finds sets of candidate attributes for individual queries and then attempts to merge them based on the entire workload [5]. The AutoPart tool identifies conflicting access patterns on tables and creates read-only vertical partitions from disjoint column subsets that are similar to our secondary indexes [36]. Further heuristics can then be applied to prune this candidate set or combine attributes into multi-attribute sets [5].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Pavlo

Curino

Zdonik

2012

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

224

172

View full text Add to dashboard Cite

The advent of affordable, shared-nothing computing systems portends a new class of parallel database management systems (DBMS) for on-line transaction processing (OLTP) applications that scale without sacrificing ACID guarantees [7,9]. The performance of these DBMSs is predicated on the existence of an optimal database design that is tailored for the unique characteristics of OLTP workloads [43]. Deriving such designs for modern DBMSs is difficult, especially for enterprise-class OLTP systems, since they impose extra challenges: the use of stored procedures, the need for load balancing in the presence of time-varying skew, complex schemas, and deployments with larger number of partitions.To this purpose, we present a novel approach to automatically partitioning databases for enterprise-class OLTP systems that significantly extends the state of the art by: (1) minimizing the number distributed transactions, while concurrently mitigating the effects of temporal skew in both the data distribution and accesses, (2) extending the design space to include replicated secondary indexes, (4) organically handling stored procedure routing, and (3) scaling of schema complexity, data size, and number of partitions. This effort builds on two key technical contributions: an analytical cost model that can be used to quickly estimate the relative coordination cost and skew for a given workload and a candidate database design, and an informed exploration of the huge solution space based on large neighborhood search. To evaluate our methods, we integrated our database design tool with a high-performance parallel, main memory DBMS and compared our methods against both popular heuristics and a state-of-the-art research prototype [17]. Using a diverse set of benchmarks, we show that our approach improves throughput by up to a factor of 16× over these other approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Many of the existing techniques for automatic database partitioning, however, are tailored for large-scale analytical applications (i.e., data warehouses) [36,40]. These approaches are based on the notion of data declustering [28], where the goal is to spread data across nodes to maximize intra-query parallelism [5,10,39,49].…”

Section: Introductionmentioning

confidence: 99%

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Pavlo

Curino

Zdonik

2012

Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

224

172

View full text Add to dashboard Cite

show abstract

“…Then, in order to limit the search space they prune the set of candidates. Similar procedures are used in other works, such as AutoPart [15], which is focused on scientific workloads. In this case only vertical and categorical partitioning are considered.…”

Section: Effect Of Imbalance Factor and Data Correlationmentioning

confidence: 99%

“…BigTable [5] and PNUTS [7] use range-based partitioning on the keys; which still is too simple for our reference queries. In general, the complexity of scientific workloads makes it hard to design a good partitioning strategy manually, so automatic partitioning is preferred [15].…”

Section: Effect Of Imbalance Factor and Data Correlationmentioning

confidence: 99%

Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases

Liroz-Gistau

Akbarinia

Pacitti

et al. 2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Applications with very large databases, where data items are continuously appended, are becoming more and more common. Thus, the development of efficient data partitioning is one of the main requirements to yield good performance. In the case of applications that have complex access patterns, e.g. scientific applications, workload-based partitioning could be exploited. However, existing workload-based approaches, which work in a static way, cannot be applied to very large databases. In this paper, we propose DynPart and DynPartGroup, two dynamic partitioning algorithms for continuously growing databases. These algorithms efficiently adapt the data partitioning to the arrival of new data elements by taking into account the affinity of new data with queries and fragments. In contrast to existing static approaches, our approach offers constant execution time, no matter the size of the database, while obtaining very good partitioning efficiency. We validated our solution through experimentation over real-world data; the results show its effectiveness.

show abstract

“…DBProxy [4] observed that most applications issue template-based queries and these queries have the same structure that contains different string or numeric constraints. AutoPart [14] deals with large scientific databases where the continuous insertions limit the application of indexes and materialized views. For optimization purposes, their algorithm horizontally and vertically partitions the tables in the original large database according to a representative workload using a single node.…”

Section: Automated Physical Design Solutionsmentioning

confidence: 99%

Automatic Physical Database Tuning Middleware for Web-Based Applications

Patvarczki

Heffernan

2011

Advances in Databases and Information Systems

View full text Add to dashboard Cite

Abstract. In this paper we conceptualize the database layout problem as a state space search problem. A state is a given assignment of tables to computer servers. We begin with a database and collect, for use as a workload input, a sequence of queries that were executed during normal usage of the database. The operators in the search are to fully replicate, horizontally partition, vertically partition, and de-normalize a table. We do a time intensive search over different table layouts, and at each iteration, physically create the configurations, and evaluate the total throughput of the system. We report our empirical results of two forms. First, we empirically validate as facts the heuristics that Database Administrators (DBAs) currently use as in doing this task manually: for tables that have a high ratio of update, delete, and insert to retrieval queries one should horizontally partition, but for a small ratio one should fully replicate a table. Such rules of thumb are reasonable, however we want to parameterize some common guidelines that DBAs can use. Our second empirical result is that we applied this search to our existing data test case and found a reliable increase in total system throughput. The search over layouts is very expensive, but we argue that our method is practical and useful, as entities trying to scale up their Web-based applications would be perfectly happy to spend a few weeks of CPU time to increase their system throughput (and potentially reduce the investment in hardware). To make this search more practical, we want to learn reasonable rules to guide the search to eliminate many layout configurations that are not very likely to succeed. The second aspect of our project (not reported here) is to use the created configurations as input into a machine learning system, to create general rules about when to use the different layout operators.

show abstract

AutoPart: automating schema design for large scientific databases using data partitioning

Cited by 58 publications

References 16 publications

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases

Automatic Physical Database Tuning Middleware for Web-Based Applications

Contact Info

Product

Resources

About