Abstract. This paper addresses the problem of building scalable semantic overlay networks. Our approach follows the principle of data independence by separating a logical layer, the semantic overlay for managing and mapping data and metadata schemas, from a physical layer consisting of a structured peer-to-peer overlay network for efficient routing of messages. The physical layer is used to implement various functions at the logical layer, including attribute-based search, schema management and schema mapping management. The separation of a physical from a logical layer allows us to process logical operations in the semantic overlay using different physical execution strategies. In particular we identify iterative and recursive strategies for the traversal of semantic overlay networks as two important alternatives. At the logical layer we support semantic interoperability through schema inheritance and Semantic Gossiping. Thus our system provides a complete solution to the implementation of semantic overlay networks supporting both scalability and interoperability.
Retrieving semi-structured entities to answer keyword queries is an increasingly important feature of many modern Web applications. The fast-growing Linked Open Data (LOD) movement makes it possible to crawl and index very large amounts of structured data describing hundreds of millions of entities. However, entity retrieval approaches have yet to find efficient and effective ways of ranking and navigating through those large data sets. In this paper, we address the problem of Ad-hoc Object Retrieval over large-scale LOD data by proposing a hybrid approach that combines IR and structured search techniques. Specifically, we propose an architecture that exploits an inverted index to answer keyword queries as well as a semi-structured database to improve the search effectiveness by automatically generating queries over the LOD graph. Experimental results show that our ranking algorithms exploiting both IR and graph indices outperform state-of-the-art entity retrieval techniques by up to 25% over the BM25 baseline.
When searching for information, a human reader first glances over a document, spots relevant sections and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates to identify the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available dataset with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR LSTM model with bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 compared to state-of-the-art CNN classifiers with baseline segmentation. 1 Our source code is available under the Apache License 2.0 at https
Information and communication infrastructures underwent a rapid and extreme decentralization process over the past decade: From a world of statically and partially connected central servers rose an intricate web of millions of information sources loosely connecting one to another. Today, we expect to witness the extension of this revolution with the wide adoption of meta-data standards like RDF or OWL underpinning the creation of a semantic web. Again, we hope for global properties to emerge from a multiplicity of pair-wise, local interactions, resulting eventually in a self-stabilizing semantic infrastructure. This paper represents an effort to summarize the conditions under which this revolution would take place as well as an attempt to underline its main properties, limitations and possible applications. The work presented in this paper reflects the current status of a collaborative effort initiated by the IFIP 2.6 Working Group on Data Semantics.
Processing large volumes of RDF data requires sophisticated tools. In recent years, much effort was spent on optimizing native RDF stores and on repurposing relational query engines for large-scale RDF processing. Concurrently, a number of new data management systemsregrouped under the NoSQL (for "not only SQL") umbrella-rapidly rose to prominence and represent today a popular alternative to classical databases. Though NoSQL systems are increasingly used to manage RDF data, it is still difficult to grasp their key advantages and drawbacks in this context. This work is, to the best of our knowledge, the first systematic attempt at characterizing and comparing NoSQL stores for RDF processing. In the following, we describe four different NoSQL stores and compare their key characteristics when running standard RDF benchmarks on a popular cloud infrastructure using both single-machine and distributed deployments.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.