Researchers and developers use benchmarks to compare their algorithms and products. A database benchmark must have a dataset D. To be application-specific, this dataset D should be empirical. However, D may be too small, or too large, for the benchmarking experiments. D must, therefore, be scaled to the desired size.To ensure the scaled D is similar to D, previous work typically specifies or extracts a fixed set of features F = {F1, F2, . . . , Fn} from D, then uses F to generate synthetic data for D. However, this approach (D → F → D ) becomes increasingly intractable as F gets larger, so a new solution is necessary.Different from existing approaches, this paper proposes AS-PECT to scale D to enforce similarity. ASPECT first uses a size-scaler (S0) to scale D to D. Then the user selects a set of desired features F1, . . . , Fn. For each desired feature F k , there is a tweaking tool T k that tweaks D to make sure D has the required feature F k . ASPECT coordinates the tweaking of T1, . . . , Tn to D, so Tn(· · · (T1( D)) · · · ) has the required features F1, . . . , Fn.By shifting from D → F → D to D → D → F, data scaling becomes flexible. The user can customise the scaled dataset with their own interested features. Extensive experiments on real datasets show that ASPECT can enforce similarity in the dataset effectively and efficiently.
The Dataset Scaling Problem (DSP) defined in previous work states: Given an empirical set of relational tables D and a scale factor s, generate a database state D that is similar to D but s times its size . A DSP solution is useful for application development ( s < 1), scalability testing ( s > 1) and anonymization ( s = 1). Current solutions assume all table sizes scale by the same ratio s . However, a real database tends to have tables that grow at different rates. This paper therefore considers non-uniform scaling (nuDSP), a DSP generalization where, instead of a single scale factor s , tables can scale by different factors. D scaler is the first solution for nuDSP. It follows previous work in achieving similarity by reproducing correlation among the primary and foreign keys. However, it introduces the concept of a correlation database that captures fine-grained, per-tuple correlation. Experiments with well-known real and synthetic datasets D show that D scaler produces D with greater similarity to D than state-of-the-art techniques. Here, similarity is measured by number of tuples, frequency distribution of foreign key references, and multi-join aggregate queries.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.