2018
DOI: 10.14778/3231751.3231760
|View full text |Cite
|
Sign up to set email alerts
|

Set similarity joins on mapreduce

Abstract: Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
28
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 42 publications
(28 citation statements)
references
References 32 publications
0
28
0
Order By: Relevance
“…These parallel algorithms for set similarity joins are compared in a recent experimental study [46]. VernicaJoin achieved the best performance in most experiments, followed by MRGroupJoin, FS-Join and MGJoin.…”
Section: Distributed Algorithmsmentioning
confidence: 99%
See 2 more Smart Citations
“…These parallel algorithms for set similarity joins are compared in a recent experimental study [46]. VernicaJoin achieved the best performance in most experiments, followed by MRGroupJoin, FS-Join and MGJoin.…”
Section: Distributed Algorithmsmentioning
confidence: 99%
“…Still, as presented in Section 5, relatively few of the existing similarity join techniques are designed for these cases. Moreover, as shown in the experimental survey of [46], scalability remains an open challenge for string and set similarity joins. Finally, there is a need for extensible, open-source ER tools that incorporate the majority of established Blocking and Filtering methods and apply seamlessly to structured, semi-structured and unstructured data [52].…”
Section: Future Directionsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the scalability experiments, we generate datasets with scale factors: 1×, 2×, 3×, 4× and 5× the number of rows in the original dataset. We follow the methodology similar to that used in Vernica et al [39] and Fier et al [19], which preserves the original set of distinct tokens, their distribution and record lengths; but increases the number of records by replacing a token with a neighboring token in the sorted token frequency order.…”
Section: Datasetsmentioning
confidence: 99%
“…Therefore, scale-out approaches including [18,39,8,33,27,31,13,16,17,30] have been developed on MapReduce [15] engines such as Hadoop. A recent experimental study by Fier et al [19] compared several scale-out techniques. Based on these results, the stateof-the-art scale-out fuzzy join techniques are lacking in two fundamental ways.…”
Section: Introductionmentioning
confidence: 99%