In this paper, we present a new algorithm for estimating the size of equality join of multiple database tables. The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can then be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions, possibly specified over multiple columns (attributes) of each table. This algorithm makes a single pass over the data and is thus suitable for streaming scenarios. We compare this algorithm analytically to two other previously known sampling approaches (independent Bernoulli Sampling and End-Biased Sampling) and to a novel sketch-based approach. We also compare these four algorithms experimentally and show that results fully correspond to our analytical predictions based on derived expressions for the estimator variances, with Correlated Sampling giving the best estimates in a large range of situations.
This paper describes enhanced subquery optimizations in Oracle relational database system. It discusses several techniques -- subquery coalescing, subquery removal using window functions, and view elimination for group-by queries. These techniques recognize and remove redundancies in query structures and convert queries into potentially more optimal forms. The paper also discusses novel parallel execution techniques, which have general applicability and are used to improve the scalability of queries that have undergone some of these transformations. It describes a new variant of antijoin for optimizing subqueries involved in the universal quantifier with columns that may have nulls. It then presents performance results of these optimizations, which show significant execution time improvements.
Traditional on-disk row major tables have been the dominant storage mechanism in relational databases for decades. Over the last decade, however, with explosive growth in data volume and demand for faster analytics, has come the recognition that a different data representation is needed. There is widespread agreement that in-memory column-oriented databases are best suited to meet the realities of this new world. Oracle 12c Database In-memory, the industry's first dual-format database, allows existing row major on-disk tables to have complementary in-memory columnar representations. The new storage format brings new data processing techniques and query execution algorithms and thus new challenges for the query optimizer. Execution plans that are optimal for one format may be sub-optimal for the other. In this paper, we describe the changes made in the query optimizer to generate execution plans optimized for the specific formatrow major or columnarthat will be scanned during query execution. With enhancements in several areasstatistics, cost model, query transformation, access path and join optimization, parallelism, and cluster-awarenessthe query optimizer plays a significant role in unlocking the full promise and performance of Oracle Database In-Memory. This section provides a brief introduction to Oracle DBIM; more details are in [13] and [18]. Oracle 12c Database In-Memory (DBIM) is a dual-format database where data from a table can reside in both columnar format in an in-memory column store and in row major format on disk. The in-memory columnar format speeds up analytic queries and the row major format is well-suited for answering OLTP queries. Note that scanning on-disk tables does not necessarily mean disk I/O; some or all of the blocks of the table may be cached in the row major buffer cache [4]. A dedicated in-memory column store called the In-Memory Area acts as the storage for columnar data. The in-memory area is a subset of the database shared global area (SGA).
Over the last few years, the information technology industry has witnessed revolutions in multiple dimensions. Increasing ubiquitous sources of data have posed two connected challenges to data management solutionsprocessing unprecedented volumes of data, and providing ad-hoc real-time analysis in mainstream production data stores without compromising regular transactional workload performance. In parallel, computer hardware systems are scaling out elastically, scaling up in the number of processors and cores, and increasing main memory capacity extensively. The data processing challenges combined with the rapid advancement of hardware systems has necessitated the evolution of a new breed of main-memory databases optimized for mixed OLTAP environments and designed to scale.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.