Companion Proceedings of the 2019 World Wide Web Conference 2019
DOI: 10.1145/3308560.3316609
|View full text |Cite
|
Sign up to set email alerts
|

The WDC Training Dataset and Gold Standard for Large-Scale Product Matching

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
36
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 56 publications
(36 citation statements)
references
References 14 publications
0
36
0
Order By: Relevance
“…We compare the performance of JointBERT to the performance of BERT, RoBERTa, Ditto, Deepmatcher, Magellan, and a word cooccurrence baseline using five entity matching benchmark datasets. Two of these datasets, WDC LSPC [27] and DI2KG monitors [8], model multi-source settings and fulfil the requirement from the problem statement in Section 3 that multiple entity descriptions should be available for many of the described entities. The other three datasets, abt-buy, dblp-scholar and company [23], do not fulfill this requirement and are included in order to evaluate the performance of JointBERT in traditional two-source settings.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…We compare the performance of JointBERT to the performance of BERT, RoBERTa, Ditto, Deepmatcher, Magellan, and a word cooccurrence baseline using five entity matching benchmark datasets. Two of these datasets, WDC LSPC [27] and DI2KG monitors [8], model multi-source settings and fulfil the requirement from the problem statement in Section 3 that multiple entity descriptions should be available for many of the described entities. The other three datasets, abt-buy, dblp-scholar and company [23], do not fulfill this requirement and are included in order to evaluate the performance of JointBERT in traditional two-source settings.…”
Section: Methodsmentioning
confidence: 99%
“…Below, we describe the datasets. [27] for the evaluation. The datasets were built by extracting product offers from the Common Crawl.…”
Section: Datasetsmentioning
confidence: 99%
See 1 more Smart Citation
“…Identifying offers for the same product is one of the central challenges for ecommerce applications such as price comparison portals and electronic market places. Training Transformer-based matchers using offers from different e-shops that share the same product identifier has proven to be a successful solution for product matching reaching F1 scores above 0.9 in many cases [4,5,3]. The bottleneck of this approach is that it requires a decent amount of pairs of offers for the products to be matched as training data.…”
Section: Introductionmentioning
confidence: 99%
“…The bottleneck of this approach is that it requires a decent amount of pairs of offers for the products to be matched as training data. For widely-used languages such as English, the required training data can be extracted from large web crawls by relying on schema.org annotations which identify product titles, product descriptions, and product identifiers such as GTIN or MPN numbers within web pages 1 [5]. For less widely used languages and less commonly sold products, it can be hard to find enough offers in the respective target language on the Web.…”
Section: Introductionmentioning
confidence: 99%