Outlier detection used for identifying wrong values in data is typically applied to single datasets to search them for values of unexpected behavior. In this work, we instead propose an approach which combines the outcomes of two independent outlier detection runs to get a more reliable result and to also prevent problems arising from natural outliers which are exceptional values in the dataset but nevertheless correct. Linked Data is especially suited for the application of such an idea, since it provides large amounts of data enriched with hierarchical information and also contains explicit links between instances. In a first step, we apply outlier detection methods to the property values extracted from a single repository, using a novel approach for splitting the data into relevant subsets. For the second step, we exploit owl:sameAs links for the instances to get additional property values and perform a second outlier detection on these values. Doing so allows us to confirm or reject the assessment of a wrong value. Experiments on the DBpedia and NELL datasets demonstrate the feasibility of our approach.
Abstract. The Linked Data cloud grows rapidly as more and more knowledge bases become available as Linked Data. Knowledge-based applications have to rely on efficient implementations of query languages like SPARQL, in order to access the information which is contained in large datasets such as DBpedia, Freebase or one of the many domain-specific RDF repositories. However, the retrieval of specific facts from an RDF dataset is often hindered by the lack of schema knowledge, that would allow for query-time inference or the materialization of implicit facts. For example, if an RDF graph contains information about films and actors, but only Titanic starring Leonardo DiCaprio is stated explicitly, a query for all movies Leonardo DiCaprio acted in might not yield the expected answer. Only if the two properties starring and actedIn are declared inverse by a suitable schema, the missing link between the RDF entites can be derived. In this work, we present an approach to enriching the schema of any RDF dataset with property axioms by means of statistical schema induction. The scalability of our implementation, which is based on association rule mining, as well as the quality of the automatically acquired property axioms are demonstrated by an evaluation on DBpedia.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.