Learning expressive linkage rules from sparse data

Petrovski, Petar; Bizer, Christian

doi:10.3233/sw-190356

Cited by 7 publications

(9 citation statements)

References 46 publications

(90 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar to product classification, typically product linking will make use of product names (e.g., Kannan et al (2011); Gopalakrishnan et al (2012);Vandic et al (2012);van Bezu et al (2015); Shah et al (2018);Tracz et al (2020); Li et al (2020)) and descriptions (e.g., Petrovski et al (2014); Ristoski et al (2018); Li et al (2020)). The difference however, is that the task also makes use of a diverse range of structured product attributes (e.g., van Bezu et al (2015); Shah et al (2018); Petrovski and Bizer (2020); Li et al (2020)), often defined as 'key-value' pairs such as those that can be extracted from product specifications (e.g., product ID, model, brand, manufacturer). Intuitively, offers that have the similar sets of key-value pairs are more likely to match.…”

Section: Product Linkingmentioning

confidence: 99%

“…Algorithms. Since the prediction of linking/matching of product offers depends on a notion of 'similarity', some methods will have an 'intermediary' step that converts product metadata features to similarity features (Vandic et al (2012); Li et al (2020); Petrovski and Bizer (2020)). This is typically done by applying similarity metrics -usually based on string form, or word/character distribution -to the textual feature representations of two offers.…”

Section: Product Linkingmentioning

confidence: 99%

See 1 more Smart Citation

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Zhang¹,

Song²

2021

Preprint

View full text Add to dashboard Cite

The Linked Open Data practice has led to a significant growth of structured data on the Web in the last decade. Such structured data describe real-world entities in a machine-readable way, and have created an unprecedented opportunity for research in the field of Natural Language Processing. However, there is a lack of studies on how such data can be used, for what kind of tasks, and to what extent they can be useful for these tasks. This work focuses on the e-commerce domain to explore methods of utilising such structured data to create language resources that may be used for product classification and linking. We process billions of structured data points in the form of RDF n-quads, to create multi-million words of product-related corpora that are later used in three different ways for creating of language resources: training word embedding models, continued pretraining of BERT-like language models, and training Machine Translation models that are used as a proxy to generate product-related keywords. Our evaluation on an extensive set of benchmarks shows word embeddings to be the most reliable and consistent method to improve the accuracy on both tasks (with up to 6.9 percentage points in macro-average F1 on some datasets). The other two methods however, are not as useful. Our analysis shows that this could be due to a number of reasons, including the biased domain representation in the structured data and lack of vocabulary coverage. We share our datasets and discuss how our lessons learned could be taken forward to inform future research in this direction.Keywords linked data • web of data • schema.org • natural language processing • nlp • data mining • product mining

show abstract

Section: Product Linkingmentioning

confidence: 99%

Section: Product Linkingmentioning

confidence: 99%

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Zhang¹,

Song²

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The coverage of models learned with different combinations of attributes can significantly vary as not all attributes contribute equally to the solution of the matching task [21]. Discovering the set of attributes that encode the most-identifying information, is crucial for the extraction of more focused profiling meta-information.…”

Section: Relevant Attributesmentioning

confidence: 99%

“…Under this group fall the following tasks: phones, headphones, and tvs. The matching methods used for evaluating these tasks need to especially address the challenge of low data density [21]. Group 3: Small and Difficult.…”

Section: Profiling and Grouping The Matching Tasksmentioning

confidence: 99%

Profiling Entity Matching Benchmark Tasks

Primpeli

Bizer

2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

Entity matching is a central task in data integration which has been researched for decades. Over this time, a wide range of benchmark tasks for evaluating entity matching methods has been developed. This resource paper systematically complements, profiles, and compares 21 entity matching benchmark tasks. In order to better understand the specific challenges associated with different tasks, we define a set of profiling dimensions which capture central aspects of the matching tasks. Using these dimensions, we create groups of benchmark tasks having similar characteristics. Afterwards, we assess the difficulty of the tasks in each group by computing baseline evaluation results using standard feature engineering together with two common classification methods. In order to enable the exact reproducibility of evaluation results, matching tasks need to contain exactly defined sets of matching and non-matching record pairs, as well as a fixed development and test split. As this is not the case for some widely-used benchmark tasks, we complement these tasks with fixed sets of non-matching pairs, as well as fixed splits, and provide the resulting development and test sets for public download. By profiling and complementing the benchmark tasks, we support researchers to select challenging as well as diverse tasks and to compare matching systems on clearly defined grounds.

show abstract

“…The data linking problem in data graphs has been the main focus of numerous studies (see [9,14] for survey), and applied in different research fields such as knowledge extraction [23,24], geospatial analysis [27], sentiment analysis [19,10], etc. Some of the existing approaches are based on expressive linking rules that can be learned from a set of existing reference links [18,16]. These rules consist of attribute-specific comparisons, aggregation functions along with different weights and thresholds.…”

Section: Related Workmentioning

confidence: 99%

BECKEY: Understanding, comparing and discovering keys of different semantics in knowledge bases

Symeonidou

Armant

Pernelle

2020

Knowledge-Based Systems

View full text Add to dashboard Cite

Integrating data coming from different knowledge bases has been one of the most important tasks in the Semantic Web the last years. Keys have been considered to be very useful in the data linking task. A set of properties is considered a key if it uniquely identifies every resource in the data. To cope with the incompleteness of the data, three different key semantics have been proposed so far. We propose BECKEY, a semantic agnostic approach that discovers keys for all three semantics, succeeding to scale on large datasets. Our approach is able to discover keys under the presence of erroneous data or duplicates (i.e., almost keys). A formalisation of the three semantics along with the relations among them is provided. An extended experimental comparison of the three key semantics has taken place. The results allow a better understanding of the three semantics, providing insights on when each semantic is more appropriate for the task of data linking.

show abstract

Learning expressive linkage rules from sparse data

Cited by 7 publications

References 46 publications

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

An Exploratory Study on Utilising the Web of Linked Data for Product Data Mining

Profiling Entity Matching Benchmark Tasks

BECKEY: Understanding, comparing and discovering keys of different semantics in knowledge bases

Contact Info

Product

Resources

About