1994
DOI: 10.1145/191843.191886
|View full text |Cite
|
Sign up to set email alerts
|

Quickly generating billion-record synthetic databases

Abstract: Evaluating database system performance often requires generating synthetic databases -ones having certain statistical properties but filled with dummy information. When evaluating different database designs, it is often necessary to generate several databases and evaluate each design. As database sizes grow to terabytes, generation often takes longer than evaluation. This paper presents several database generation techniques. In particular it discusses:(1) Parallelism to get generation speedup and scaleup.(2) … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
126
0

Year Published

2002
2002
2012
2012

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 138 publications
(126 citation statements)
references
References 7 publications
0
126
0
Order By: Relevance
“…To generate the Zipfian distribution, our system utilizes YCSB tool [14], which employs the algorithm for generating a Zipfian-distributed sequence from Gray et al [17]. In our case, we use a clustered distribution: i.e., the popular items are clustered together towards 0 (smaller values are more popular).…”
Section: Data and Workload Generatormentioning
confidence: 99%
“…To generate the Zipfian distribution, our system utilizes YCSB tool [14], which employs the algorithm for generating a Zipfian-distributed sequence from Gray et al [17]. In our case, we use a clustered distribution: i.e., the popular items are clustered together towards 0 (smaller values are more popular).…”
Section: Data and Workload Generatormentioning
confidence: 99%
“…We first generate a set of data nodes whose access frequencies follow the Zipf(l, h) distribution [31], where l is the mean and h increases with the skewness of the data. The size of a data node in terms of buckets is randomly selected from the range 10-20.…”
Section: Performance Evaluationmentioning
confidence: 99%
“…The access frequencies of the data items are generated based on the Zipf distribution [11]. In the Zipf distribution, the access frequencies of the data items follow the 80/20 rule that 80 percent clients are usually interested in 20 percent data items.…”
Section: Simulation Modelmentioning
confidence: 99%