Nowadays, there is a rapid increase in the number of sensor data produced by a wide variety of devices and sensors. Collections of sensor data can be semantically described using ontologies, e.g., the Semantic Sensor Network (SSN) ontology. Albeit semantically enriched, the volume of semantic sensor data is considerably larger than raw sensor data. Moreover, some measurement values can be observed several times, and a large number of repeated facts can be generated. We devise a compact or factorized representation of semantic sensor data, where repeated values are represented only once. To scale up to large datasets, tabular representation is utilized to store and manage factorized semantic sensor data using Big data technologies. We empirically study the effectiveness of the proposed factorized representation of semantic sensor data, and the impact of factorizing semantic sensor data on query processing. Furthermore, we evaluate the effects of storing RDF factorized data on state-of-The-Art RDF engines and in the proposed tabular-based representation. Results suggest that factorization techniques empower storage and query processing of sensor data, and execution time can be reduced by up to two orders of magnitude
General rightsCopyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal
Take down policyIf you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Over the last years, the Semantic Web has been growing steadily. Today, we count more than 10,000 datasets made available online following Semantic Web standards. Nevertheless, many applications, such as data integration, search, and interlinking, may not take the full advantage of the data without having a priori statistical information about its internal structure and coverage. In fact, there are already a number of tools, which offer such statistics, providing basic information about RDF datasets and vocabularies. However, those usually show severe deficiencies in terms of performance once the dataset size grows beyond the capabilities of a single machine. In this paper, we introduce a software component for statistical calculations of large RDF datasets, which scales out to clusters of machines. More specifically, we describe the first distributed inmemory approach for computing 32 different statistical criteria for RDF datasets using Apache Spark. The preliminary results show that our distributed approach improves upon a previous centralized approach we compare against and provides approximately linear horizontal scale-up. The criteria are extensible beyond the 32 default criteria, is integrated into the larger SANSA framework and employed in at least four major usage scenarios beyond the SANSA community.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.