The WDC Training Dataset and Gold Standard for Large-Scale Product Matching

Primpeli, Anna; Peeters, Ralph; Bizer, Christian

doi:10.1145/3308560.3316609

Cited by 56 publications

(36 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We compare the performance of JointBERT to the performance of BERT, RoBERTa, Ditto, Deepmatcher, Magellan, and a word cooccurrence baseline using five entity matching benchmark datasets. Two of these datasets, WDC LSPC [27] and DI2KG monitors [8], model multi-source settings and fulfil the requirement from the problem statement in Section 3 that multiple entity descriptions should be available for many of the described entities. The other three datasets, abt-buy, dblp-scholar and company [23], do not fulfill this requirement and are included in order to evaluate the performance of JointBERT in traditional two-source settings.…”

Section: Methodsmentioning

confidence: 99%

“…Below, we describe the datasets. [27] for the evaluation. The datasets were built by extracting product offers from the Common Crawl.…”

Section: Datasetsmentioning

confidence: 99%

“…In addition to the previous experiments, we further analyze the strong performance of JointBERT when trained using large amounts of training data by evaluating all models, trained with the computers xlarge training set 5 , on the test set of the Semantic Web Challenge on Mining the Web of HTML-embedded Product Data (MWPD) which took place at ISWC2020 [37]. This test set contains 1,500 product offer pairs from the WDC LSPC computers category [27]. For the purposes of the challenge, this test set was made intentionally 5 Results for other sets are found at https://github.com/wbsg-uni-mannheim/jointbert hard by selecting mainly very hard pairs as well as further augmenting some of them to derive subsets of pairs posing specific matching challenges (100 pairs each).…”

Section: Challenge-specific Analysismentioning

confidence: 99%

See 2 more Smart Citations

Dual-objective fine-tuning of BERT for entity matching

Peeters

Bizer

2021

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

An increasing number of data providers have adopted shared numbering schemes such as GTIN, ISBN, DUNS, or ORCID numbers for identifying entities in the respective domain. This means for data integration that shared identifiers are often available for a subset of the entity descriptions to be integrated while such identifiers are not available for others. The challenge in these settings is to learn a matcher for entity descriptions without identifiers using the entity descriptions containing identifiers as training data. The task can be approached by learning a binary classifier which distinguishes pairs of entity descriptions for the same real-world entity from descriptions of different entities. The task can also be modeled as a multi-class classification problem by learning classifiers for identifying descriptions of individual entities. We present a dual-objective training method for BERT, called JointBERT, which combines binary matching and multi-class classification, forcing the model to predict the entity identifier for each entity description in a training pair in addition to the match/non-match decision. Our evaluation across five entity matching benchmark datasets shows that dual-objective training can increase the matching performance for seen products by 1% to 5% F1 compared to single-objective Transformer-based methods, given that enough training data is available for both objectives. In order to gain a deeper understanding of the strengths and weaknesses of the proposed method, we compare JointBERT to several other BERT-based matching methods as well as baseline systems along a set of specific matching challenges. This evaluation shows that JointBERT, given enough training data for both objectives, outperforms the other methods on tasks involving seen products, while it underperforms for unseen products. Using a combination of LIME explanations and domain-specific word classes, we analyze the matching decisions of the different deep learning models and conclude that BERT-based models are better at focusing on relevant word classes compared to RNN-based models.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Below, we describe the datasets. [27] for the evaluation. The datasets were built by extracting product offers from the Common Crawl.…”

Section: Datasetsmentioning

confidence: 99%

Section: Challenge-specific Analysismentioning

confidence: 99%

See 1 more Smart Citation

Dual-objective fine-tuning of BERT for entity matching

Peeters

Bizer

2021

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Identifying offers for the same product is one of the central challenges for ecommerce applications such as price comparison portals and electronic market places. Training Transformer-based matchers using offers from different e-shops that share the same product identifier has proven to be a successful solution for product matching reaching F1 scores above 0.9 in many cases [4,5,3]. The bottleneck of this approach is that it requires a decent amount of pairs of offers for the products to be matched as training data.…”

Section: Introductionmentioning

confidence: 99%

“…The bottleneck of this approach is that it requires a decent amount of pairs of offers for the products to be matched as training data. For widely-used languages such as English, the required training data can be extracted from large web crawls by relying on schema.org annotations which identify product titles, product descriptions, and product identifiers such as GTIN or MPN numbers within web pages 1 [5]. For less widely used languages and less commonly sold products, it can be hard to find enough offers in the respective target language on the Web.…”

Section: Introductionmentioning

confidence: 99%

Cross-Language Learning for Entity Matching

Peeters,

Bizer

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer-based matching methods have significantly moved the state-of-the-art for less-structured matching tasks involving textual entity descriptions. In order to excel on these tasks, Transformer-based matching methods require a decent amount of training pairs. Providing enough training data can be challenging, especially if a matcher for non-English entity descriptions should be learned. This paper explores along the use case of matching product offers from different e-shops to which extent it is possible to improve the performance of Transformer-based entity matchers by complementing a small set of training pairs in the target language, German in our case, with a larger set of English-language training pairs. Our experiments using different Transformers show that extending the German set with English pairs is always beneficial. The impact of adding the English pairs is especially high in low-resource settings in which only a rather small number of non-English pairs is available. As it is often possible to automatically gather English training pairs from the Web by using schema.org annotations, our results could proof relevant for many product matching scenarios targeting low-resource languages.

show abstract