Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins

Neumann, Thomas; Moerkotte, Guido

doi:10.1109/icde.2011.5767868

Cited by 167 publications

(206 citation statements)

References 7 publications

Supporting

Mentioning

199

Contrasting

Order By: Relevance

“…The first query is used for identifying "characteristic sets" [13]: frequently co-occurring properties with a subject. The second identifies all the properties used in the dataset and sorts them according to their frequency.…”

Section: Methodsmentioning

confidence: 99%

Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig

Kotoulas

Urbani

Boncz

et al. 2012

The Semantic Web – ISWC 2012

View full text Add to dashboard Cite

Abstract. We describe a system that incrementally translates SPARQL queries to Pig Latin and executes them on a Hadoop cluster. This system is designed to work efficiently on complex queries with many self-joins over huge datasets, avoiding job failures even in the case of joins with unexpected high-value skew. To be robust against cost estimation errors, our system interleaves query optimization with query execution, determining the next steps to take based on data samples and statistics gathered during the previous step. Furthermore, we have developed a novel skew-resistant join algorithm that replicates tuples corresponding to popular keys. We evaluate the effectiveness of our approach both on a synthetic benchmark known to generate complex queries (BSBM-BI) as well as on a Yahoo! case of data analysis using RDF data crawled from the web. Our results indicate that our system is indeed capable of processing huge datasets without pre-computed statistics while exhibiting good load-balancing properties.

show abstract

Section: Methodsmentioning

confidence: 99%

Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig

Kotoulas

Urbani

Boncz

et al. 2012

The Semantic Web – ISWC 2012

View full text Add to dashboard Cite

show abstract

“…We obtain a more compact schema than [10], by using the TF/IDF (Term Frequency/Inverted Document Frequency) measure from information retrieval [16] to detect discriminative properties, and using semantic information to merge similar CS's. Further, a schema graph of CS's is created by analyzing the co-reference relationship statistics between CS's.…”

Section: Emergent Schemasmentioning

confidence: 99%

“…This was observed in the proposal to make SPARQL query optimization more reliable by recognizing "characteristics sets" [10]. A characteristic set is a combination of properties that typically co-occur with the same subject.…”

mentioning

confidence: 99%

“…A characteristic set is a combination of properties that typically co-occur with the same subject. The work in [10] found that this number is limited to a few thousand on even the most complex LOD datasets (like DBpedia), and the CWI research on emergent schema detection that started in the LOD2 project [15] aims to further reduce the amount of characteristic sets to the point that characteristics sets become tables in a table of limited size (less than 100), i.e. further reducing the size.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Advances in Large-Scale RDF Data Management

Boncz

Erling

Pham

2014

Linked Open Data -- Creating Knowledge Out of Interlinked Data

View full text Add to dashboard Cite

Abstract.One of the prime goals of the LOD2 project is improving the performance and scalability of RDF storage solutions so that the increasing amount of Linked Open Data (LOD) can be efficiently managed. Virtuoso has been chosen as the basic RDF store for the LOD2 project, and during the project it has been significantly improved by incorporating advanced relational database techniques from MonetDB and Vectorwise, turning it into a compressed column store with vectored execution. This has reduced the performance gap ("RDF tax") between Virtuoso's SQL and SPARQL query performance in a way that still respects the "schema-last" nature of RDF. However, by lacking schema information, RDF database systems such as Virtuoso still cannot use advanced relational storage optimizations such as table partitioning or clustered indexes and have to execute SPARQL queries with many selfjoins to a triple table, which leads to more join effort than needed in SQL systems. In this chapter, we first discuss the new column store techniques applied to Virtuoso, the enhancements in its cluster parallel version, and show its performance using the popular BSBM benchmark at the unsurpassed scale of 150 billion triples. We finally describe ongoing work in deriving an "emergent" relational schema from RDF data, which can help to close the performance gap between relational-based and RDF-based storage solutions. General ObjectivesOne of the objectives of the LOD2 EU project is to boost the performance and the scalability of RDF storage solutions so that it can, efficiently manage huge datasets of Linked Open Data (LOD). However, it has been noted that given similar data management tasks, relational database technology significantly outperformed RDF data stores. One controlled scenario in which the two technologies can be compared is the BSBM benchmark [2], which exists equivalent relational and RDF variants. As illustrated in Fig. 1, while the SQL systems can process by up to 40-175K QMpH, the Triple stores can only reach 1-10K QMpH, showing a factor of 15-40 of performances difference.

show abstract

“…However, the lack of a central schema causes a series of difficulties in the consumption of such data (e.g., [9,11,1,14]), e.g., having two different population numbers in the same KB. For instance, data users and knowledge engineers need an understanding of what information is available in order to write queries, and to reuse or engineer KBs [15,26]. In data management, cardinality is an important aspect of the structure of data.…”

Section: Introductionmentioning

confidence: 99%

Mining Cardinalities from Knowledge Bases

Muñoz

Nickles

2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Cardinality is an important structural aspect of data that has not received enough attention in the context of RDF knowledge bases (KBs). Information about cardinalities can be useful for data users and knowledge engineers when writing queries, reusing or engineering KBs. Such cardinalities can be declared using OWL and RDF constraint languages as constraints on the usage of properties over instance data. However, their declaration is optional and consistency with the instance data is not ensured. In this paper, we address the problem of mining cardinality bounds for properties to discover structural characteristics of KBs, and use these bounds to assess completeness. Because KBs are incomplete and error-prone, we apply statistical methods for filtering property usage and for finding accurate and robust patterns. Accuracy of the cardinality patterns is ensured by properly handling equality axioms (owl:sameAs); and robustness by filtering outliers. We report an implementation of our algorithm with two variants using SPARQL 1.1 and Apache Spark, and their evaluation on real-world and synthetic data.

show abstract

Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins

Cited by 167 publications

References 7 publications

Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig

Robust Runtime Optimization and Skew-Resistant Execution of Analytical SPARQL Queries on Pig

Advances in Large-Scale RDF Data Management

Mining Cardinalities from Knowledge Bases

Contact Info

Product

Resources

About