Set similarity joins on mapreduce

Fier, Fabian; Augsten, Nikolaus; Bouros, Panagiotis; Leser, Ulf; Freytag, Johann-Christoph

doi:10.14778/3231751.3231760

Cited by 42 publications

(28 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…These parallel algorithms for set similarity joins are compared in a recent experimental study [46]. VernicaJoin achieved the best performance in most experiments, followed by MRGroupJoin, FS-Join and MGJoin.…”

Section: Distributed Algorithmsmentioning

confidence: 99%

“…Still, as presented in Section 5, relatively few of the existing similarity join techniques are designed for these cases. Moreover, as shown in the experimental survey of [46], scalability remains an open challenge for string and set similarity joins. Finally, there is a need for extensible, open-source ER tools that incorporate the majority of established Blocking and Filtering methods and apply seamlessly to structured, semi-structured and unstructured data [52].…”

Section: Future Directionsmentioning

confidence: 99%

“…Recent surveys on string and set similarity joins also exist but have a much more limited scope. They focus exclusively on either centralized [60,84,167] or distributed approaches [46], with the purpose of experimental comparison, and without covering approximate techniques or methods that allow for more relaxed matching criteria. Also, none of these surveys considers similarity joins in the broader context of ER.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Blocking and Filtering Techniques for Entity Resolution

et al. 2020

View full text Add to dashboard Cite

Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that correspond to the same real-world object. Due to its inherently quadratic complexity, a series of techniques accelerate it so that it scales to voluminous data. In this survey, we review a large number of relevant works under two different but related frameworks: Blocking and Filtering. The former restricts comparisons to entity pairs that are more likely to match, while the latter identifies quickly entity pairs that are likely to satisfy predetermined similarity thresholds. We also elaborate on hybrid approaches that combine different characteristics. For each framework we provide a comprehensive list of the relevant works, discussing them in the greater context. We conclude with the most promising directions for future work in the field.

show abstract

Section: Distributed Algorithmsmentioning

confidence: 99%

Section: Future Directionsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Blocking and Filtering Techniques for Entity Resolution

et al. 2020

View full text Add to dashboard Cite

show abstract

“…For the scalability experiments, we generate datasets with scale factors: 1×, 2×, 3×, 4× and 5× the number of rows in the original dataset. We follow the methodology similar to that used in Vernica et al [39] and Fier et al [19], which preserves the original set of distinct tokens, their distribution and record lengths; but increases the number of records by replacing a token with a neighboring token in the sorted token frequency order.…”

Section: Datasetsmentioning

confidence: 99%

“…Therefore, scale-out approaches including [18,39,8,33,27,31,13,16,17,30] have been developed on MapReduce [15] engines such as Hadoop. A recent experimental study by Fier et al [19] compared several scale-out techniques. Based on these results, the stateof-the-art scale-out fuzzy join techniques are lacking in two fundamental ways.…”

Section: Introductionmentioning

confidence: 99%

Customizable and scalable fuzzy join for big data

et al. 2019

View full text Add to dashboard Cite

Fuzzy join is an important primitive for data cleaning. The ability to customize fuzzy join is crucial to allow applications to address domain-specific data quality issues such as synonyms and abbreviations. While efficient indexing techniques exist for single-node implementations of customizable fuzzy join, the state-of-the-art scale-out techniques do not support customization, and exhibit poor performance and scalability characteristics. We describe the design of a scaleout fuzzy join operator that supports customization. We use a locality-sensitive-hashing (LSH) based signature scheme, and introduce optimizations that result in significant speed up with negligible impact on recall. We evaluate our implementation on the Azure Databricks version of Spark using several real-world and synthetic data sets. We observe speedups exceeding 50X compared to the best-known prior scale-out technique, and close to linear scalability with data size and number of nodes.

show abstract

Query Driven Entity Resolution in Data Lakes

Alexiou

Papastefanatos

2020

Communications in Computer and Information Science

View full text Add to dashboard Cite

Set similarity joins on mapreduce

Cited by 42 publications

References 32 publications

Blocking and Filtering Techniques for Entity Resolution

Blocking and Filtering Techniques for Entity Resolution

Customizable and scalable fuzzy join for big data

Query Driven Entity Resolution in Data Lakes

Contact Info

Product

Resources

About