Petar Ristoski scite author profile

Abstract. Linked Open Data has been recognized as a valuable source for background information in data mining. However, most data mining tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph substructures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. Our evaluation shows that such vector representations outperform existing techniques for the propositionalization of RDF graphs on a variety of different predictive machine learning tasks, and that feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.

show abstract

RDF2Vec: RDF graph embeddings and their applications

Ristoski

et al. 2019

View full text Add to dashboard Cite

Linked Open Data has been recognized as a valuable source for background information in many data mining and information retrieval tasks. However, most of the existing tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. We evaluate our approach on three different tasks: (i) standard machine learning tasks, (ii) entity and document modeling, and (iii) content-based recommender systems. The evaluation shows that the proposed entity embeddings outperform existing techniques, and that pre-computed feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.

show abstract

I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs

Schulz

Ristoski

Paulheim

2013

View full text Add to dashboard Cite

Semantic Web in Data Mining and Knowledge Discovery: A Comprehensive Survey

Ristoski

Paulheim

2016

SSRN Journal

View full text Add to dashboard Cite

a b s t r a c tData Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. This survey article gives a comprehensive overview of those approaches in different stages of the knowledge discovery process. As an example, we show how Linked Open Data can be used at various stages for building content-based recommender systems. The survey shows that, while there are numerous interesting research works performed, the full potential of the Semantic Web and Linked Open Data for data mining and KDD is still to be unlocked.

show abstract

A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

Ristoski

Vries²,

Paulheim

2016

View full text Add to dashboard Cite

Abstract.Resource type: Datasets Permanent URL: http://w3id.org/sw4ml-datasets In the recent years, several approaches for machine learning on the Semantic Web have been proposed. However, no extensive comparisons between those approaches have been undertaken, in particular due to a lack of publicly available, acknowledged benchmark datasets. In this paper, we present a collection of 22 benchmark datasets of different sizes.Such a collection of datasets can be used to conduct quantitative performance testing and systematic comparisons of approaches.

show abstract

The Mannheim Search Join Engine

et al. 2015

View full text Add to dashboard Cite

Feature Selection in Hierarchical Feature Spaces

Ristoski

Paulheim

2014

View full text Add to dashboard Cite

Abstract. Feature selection is an important preprocessing step in data mining, which has an impact on both the runtime and the result quality of the subsequent processing steps. While there are many cases where hierarchic relations between features exist, most existing feature selection approaches are not capable of exploiting those relations. In this paper, we introduce a method for feature selection in hierarchical feature spaces. The method first eliminates redundant features along paths in the hierarchy, and further prunes the resulting feature set based on the features' relevance. We show that our method yields a good trade-off between feature space compression and classification accuracy, and outperforms both standard approaches as well as other approaches which also exploit hierarchies.

show abstract

A machine learning approach for product matching and categorization

Ristoski

Petrovski

Mika

et al. 2018

View full text Add to dashboard Cite

Consumers today have the option to purchase products from thousands of e-shops. However, the completeness of the product specifications and the taxonomies used for organizing the products differ across different e-shops. To improve the consumer experience, approaches for product integration on the Web are needed. In this paper, we present an approach that leverages deep learning techniques in combination with standard classification approaches for product matching and categorization. In our approach we use structured product data as supervision for training feature extraction models able to extract attribute-value pairs from textual product descriptions. To minimize the need for lots of data for supervision, we use neural language models to produce word embeddings from large quantities of publicly available product data marked up with Microdata, which boost the performance of the feature extraction model, thus leading to better product matching and categorization performances. Furthermore, we use a deep Convolutional Neural Network to produce image embeddings from product images, which further improve the results on both tasks.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Petar Ristoski

RDF2Vec: RDF Graph Embeddings for Data Mining

RDF2Vec: RDF graph embeddings and their applications

I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs

Semantic Web in Data Mining and Knowledge Discovery: A Comprehensive Survey

A Collection of Benchmark Datasets for Systematic Evaluations of Machine Learning on the Semantic Web

The Mannheim Search Join Engine

Feature Selection in Hierarchical Feature Spaces

A machine learning approach for product matching and categorization

Contact Info

Product

Resources

About