Development of a Semi-synthetic Dataset as a Testbed for Big-Data Semantic Analytics

Techentin, Robert W.; Foti, Daniel; Li, Peter; Daniel, Erik; Gilbert, Barry K.; Holmes, David; Al-Saffar, Sinan

doi:10.1109/icsc.2014.45

Cited by 1 publication

(3 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In prior work, we demonstrated that query time varied based on complexity; however, we also observed substantial variability in Virtuoso execution time for queries of similar complexity [2]. In order to better understand the variability, we computed all of the instances of "in edges" and "out edges" of the semantic graph schema for two queries with seven joins but a 20-fold difference in execution time.…”

Section: Exploring the Datasetmentioning

confidence: 87%

“…With the development of distributed analyses [4] and large memory graph machines [5], it is important to develop semantically rich datasets for testing new semantic technologies. We extend our prior work [2] with an open, semi-synthetic, large, irregularly structured dataset for use in semantic analysis algorithm development and benchmarking of new triple-store technologies.…”

Section: Introductionmentioning

confidence: 99%

“…For complex queries greater than 28 joins, the MySQL platform is unable to complete the query in a reasonable amount of time. (44135) Derived from[2] …”

mentioning

confidence: 99%

See 2 more Smart Citations

Characterization of semi-synthetic dataset for big-data semantic analysis

Techentin

Foti

Al-Saffar

et al. 2014

2014 IEEE High Performance Extreme Computing Conference (HPEC)

Self Cite

View full text Add to dashboard Cite

Over the past decade, the use of semantic databases has served as the basis for storing and analyzing complex, heterogeneous, and irregular data. While there are similarities with traditional relational database systems, semantic data stores provide a rich platform for conducting nontraditional analyses of data. In support of new graph analytic algorithms and specialized graph analytic hardware, we have developed a large semi-synthetic, semantically rich dataset. The construction of this dataset mimics the real-world scenario of using relational databases as the basis for semantic data construction. In order to achieve real-world variable distributions and variable dependencies, data.gov data was used as the basis for developing an approach to build arbitrarily large semi-synthetic datasets. The intent of the semi-synthetic dataset is to serve as a testbed for new semantic graph analyses and computational software/hardware platforms. The construction process and basic data characterization is described. All code related to the data collection, consolidation, and augmentation are available for distribution.

show abstract

Section: Exploring the Datasetmentioning

confidence: 87%