J. W. Zhang scite author profile

J. W. Zhang

2Publications

0Citation Statements Received

55Citation Statements Given

How they've been cited

How they cite others

Affiliations

National University of Singapore

Publications

Order By: Most citations

A collaborative framework for tweaking properties in a synthetic dataset

2018

View full text Add to dashboard Cite

Researchers and developers use benchmarks to compare their algorithms and products. A database benchmark must have a dataset D. To be application-specific, this dataset D should be empirical. However, D may be too small, or too large, for the benchmarking experiments. D must, therefore, be scaled to the desired size.To ensure the scaled D is similar to D, previous work typically specifies or extracts a fixed set of features F = {F1, F2, . . . , Fn} from D, then uses F to generate synthetic data for D. However, this approach (D → F → D ) becomes increasingly intractable as F gets larger, so a new solution is necessary.Different from existing approaches, this paper proposes AS-PECT to scale D to enforce similarity. ASPECT first uses a size-scaler (S0) to scale D to D. Then the user selects a set of desired features F1, . . . , Fn. For each desired feature F k , there is a tweaking tool T k that tweaks D to make sure D has the required feature F k . ASPECT coordinates the tweaking of T1, . . . , Tn to D, so Tn(· · · (T1( D)) · · · ) has the required features F1, . . . , Fn.By shifting from D → F → D to D → D → F, data scaling becomes flexible. The user can customise the scaled dataset with their own interested features. Extensive experiments on real datasets show that ASPECT can enforce similarity in the dataset effectively and efficiently.

show abstract

Dscaler

Zhang

Tay

2016

Proc. VLDB Endow.

View full text Add to dashboard Cite

The Dataset Scaling Problem (DSP) defined in previous work states: Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size . A DSP solution is useful for application development ( s < 1), scalability testing ( s > 1) and anonymization ( s = 1). Current solutions assume all table sizes scale by the same ratio s . However, a real database tends to have tables that grow at different rates. This paper therefore considers non-uniform scaling (nuDSP), a DSP generalization where, instead of a single scale factor s , tables can scale by different factors. D scaler is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a correlation database that captures fine-grained, per-tuple correlation. Experiments with well-known real and synthetic datasets D show that D scaler produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

J. W. Zhang

A collaborative framework for tweaking properties in a synthetic dataset

Dscaler

Contact Info

Product

Resources

About