Matching product titles using web-based enrichment

Gopalakrishnan, Vishrawas; Iyengar, Suresh; Madaan, Amit; Rastogi, Rajeev; Sengamedu, Srinivasan H.

doi:10.1145/2396761.2396839

Cited by 35 publications

(40 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, these basic similarity functions cannot consistently perform well across the remaining three categories. Variations of the basic similarity approach, such as "Term Frequency × Inverse Document Frequency" (TF-IDF) [22] that has a global scope and stable ranking mechanism, can handle Category 3 fairly well, but fail in Category 4 for the reasons presented in [16].…”

Section: Unmatched Titles With High Degree Of Token Overlapmentioning

confidence: 99%

“…There have been some recent works such as [8] and [16] (EN+IMP) that try to overcome the above limitations of these basic approaches. Although these fare better than most standard approaches -see Category 4, their efficacy is limited.…”

Section: Unmatched Titles With High Degree Of Token Overlapmentioning

confidence: 99%

“…Although these fare better than most standard approaches -see Category 4, their efficacy is limited. This is largely due to their dependency on reducing the comparison to exact matches [8] or using a similarity measure that makes use of weights that are computed in a way which is agnostic to the compared title [16]. Furthermore, both these methods, as well as TF-IDF, fail in understanding the semantic relationships -see TF-IDF and EN+IMP for Category 2 in Table 1.…”

Section: Unmatched Titles With High Degree Of Token Overlapmentioning

confidence: 99%

“…The essence of this problem has now blended itself in a variety of modern settings in the Internet-age, including aligning multilingual texts for machine translation, IP aliasing in networks [25], reconciling photographs with correct tags [21], or matching product titles across different online merchants [8,16,18]. The diversity in the target domains has led to a varied set of solutions tailored for individual domains.…”

Section: Introductionmentioning

confidence: 99%

“…For example the product "d200 10.2 megapixel digital slr camera body with lens kit -18mm -135mm (2.5"lcd -7.5x optical zoom -3872×2592 image)" from PriceGrabber.com is represented simply as "nikon d200" at CNET.com [16]. This problem is very common, in part due to the general lack of standardisation in how a product is published, curated and managed [17].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Matching titles with cross title web-search enrichment and community detection

et al. 2014

Self Cite

View full text Add to dashboard Cite

Title matching refers roughly to the following problem. We are given two strings of text obtained from different data sources. The texts refer to some underlying physical entities and the problem is to report whether the two strings refer to the same physical entity or not. There are manifestations of this problem in a variety of domains, such as product or bibliography matching, and location or person disambiguation.We propose a new approach to solving this problem, consisting of two main components. The first component uses Web searches to "enrich" the given pair of titles: making titles that refer to the same physical entity more similar, and those which do not, much less similar. A notion of similarity is then measured using the second component, where the tokens from the two titles are modelled as vertices of a "social" network graph. A "strength of ties" style of clustering algorithm is then applied on this to see whether they form one cohesive "community" (matching titles), or separately clustered communities (mismatching titles). Experimental results confirm the effectiveness of our approach over existing title matching methods across several input domains.

show abstract

Section: Unmatched Titles With High Degree Of Token Overlapmentioning

confidence: 99%

Section: Unmatched Titles With High Degree Of Token Overlapmentioning

confidence: 99%