Evaluating database system performance often requires generating synthetic databases -ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses:(1) Parallelism to get generation speedup and scaleup.(2) Congruential generators to get dense unique uniform distributions.(3) Special-case discrete logarithms to generate indices concurrent to the base table generation.(4) Modification of (2) to get exponential, normal, and self-similar distributions.The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.
Evaluating database system performance often requires generating synthetic databases -ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses:(1) Parallelism to get generation speedup and scaleup.(2) Congruential generators to get dense unique uniform distributions.(3) Special-case discrete logarithms to generate indices concurrent to the base table generation.(4) Modification of (2) to get exponential, normal, and self-similar distributions.The discussion is in terms of generating billion-record SQL databases using C programs running on a shared-nothing computer system consisting of a hundred processors, with a thousand discs. The ideas apply to smaller databases, but large databases present the more difficult problems.
We describe the use of parallel execution techniques and measure the price of parallel execution in NonStop SQL/MP, a commercial parallel datahase system from Tandem Computers. Non-Stop SQL uses intra-operator parallelism to parallelize joins, groupings and scans. Parallel execution consists of starting up several processes and communicating data between them. Our measurements show (a) Startup costs are negligible when processes are reused rather than created afresh (b) Communication costs are significant -they may exceed the costs of operators such as scan, grouping or join. We also show two counter-examples to the common intuition that parallel execution reduces response time at the expense of increased work-parallel execution may reduce work or may increase response time depending on communication costs.All execution times reported in the paper are scaled. No inferences should be drawn about actual execution times. All query executions reported in the paper were created by bypassing the NonStop SQL optimizer. No inferences should be drawn about the behavior of the optimizer.
NonStop SQL is an implementation of ANSI/ISO SQL on Tandem Computer systems. In its second release, NonStop SQL transparently and automatically implements parallelism within an SQL statement. This parallelism allows query execution speed to increase almost linearly as processors and discs are added to the system-speedup. In addition, this parallelism can help jobs restricted to a fIxed "batch window". When the job doubles in size, its elapsed processing time will not change if proportionately more equipment is available to process the job-scaleup. This paper describes the parallelism features of NonStop SQL and an audited benchmark that demonstrates these speedup and scaleup claims.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.