Quickly generating billion-record synthetic databases

Gray, Jim; Sundaresan, Prakash; Englert, Susanne; Bacławski, Kenneth; Weinberger, P.

doi:10.1145/191839.191886

Cited by 203 publications

(60 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We in the computer science community have traditionally focused on scaling in size: how to efficiently manipulate large disk-bound data via suitable data structures [213], how to scale to databases of petabytes [114], synthesize massive data sets [115], etc. However, far less attention has been given to benchmarking, studying performance of systems under rapid updates with near-real time analyses.…”

Section: The Data Stream Phenomenonmentioning

confidence: 99%

Data Streams: Algorithms and Applications

Muthukrishnan

2005

FNT in Theoretical Computer Science

768

541

View full text Add to dashboard Cite

In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudo-random computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].

show abstract

Section: The Data Stream Phenomenonmentioning

confidence: 99%

Data Streams: Algorithms and Applications

Muthukrishnan

2005

FNT in Theoretical Computer Science

768

541

View full text Add to dashboard Cite

show abstract

“…We need to incorporate both the true data via r i /W as well as our most pessimistic belief of the underlying skew. As a pessimistic prior, we choose the highly skewed Grays selfsimilar distribution [20], often used for the 80/20 rule. Only if we find a sequence which can not be explained (with more than 1% chance) with the 80/20 distribution, we believe we have encountered list walking.…”

Section: A Detecting Listsmentioning

confidence: 99%

Crowdsourced enumeration queries

Trushkowsky

Kraska

Franklin

et al. 2013

2013 IEEE 29th International Conference on Data Engineering (ICDE)

View full text Add to dashboard Cite

Abstract-Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore, query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform.

show abstract

“…An important milestone was the paper by Gray et al [12], the authors showed how to generate data sets with different distributions and dense unique sequences in linear time and in parallel. Fast, parallel generation of data with special distribution characteristics is the foundation of our data generation approach.…”

Section: Related Workmentioning

confidence: 99%

A Data Generator for Cloud-Scale Benchmarking

Rabl

Frank

Sergieh

et al. 2011

Performance Evaluation, Measurement and Characterization of Complex Systems

View full text Add to dashboard Cite

Abstract. In many fields of research and business data sizes are breaking the petabyte barrier. This imposes new problems and research possibilities for the database community. Usually, data of this size is stored in large clusters or clouds. Although clouds have become very popular in recent years, there is only little work on benchmarking cloud applications. In this paper we present a data generator for cloud sized applications. Its architecture makes the data generator easy to extend and to configure. A key feature is the high degree of parallelism that allows linear scaling for arbitrary numbers of nodes. We show how distributions, relationships and dependencies in data can be computed in parallel with linear speed up.

show abstract

Quickly generating billion-record synthetic databases

Cited by 203 publications

References 8 publications

Data Streams: Algorithms and Applications

Data Streams: Algorithms and Applications

Crowdsourced enumeration queries

A Data Generator for Cloud-Scale Benchmarking

Contact Info

Product

Resources

About