2012 IEEE 28th International Conference on Data Engineering 2012
DOI: 10.1109/icde.2012.66
|View full text |Cite
|
Sign up to set email alerts
|

Fuzzy Joins Using MapReduce

Abstract: Abstract-Fuzzy/similarity joins have been widely studied in the research community and extensively used in real-world applications. This paper proposes and evaluates several algorithms for finding all pairs of elements from an input set that meet a similarity threshold. The computation model is a single MapReduce job. Because we allow only one MapReduce round, the Reduce function must be designed so a given output pair is produced by only one task; for many algorithms, satisfying this condition is one of the b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
81
0
1

Year Published

2013
2013
2018
2018

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 82 publications
(82 citation statements)
references
References 20 publications
0
81
0
1
Order By: Relevance
“…That is, the replication rate is 1, independent of the limit q on reducer size. 3 Since the replication rate is identically 1, there is no tradeoff at all between q and replication rate; i.e., the word-count problem is embarrassingly parallel, as we knew all along.…”
Section: Mapping Schemasmentioning
confidence: 99%
See 2 more Smart Citations
“…That is, the replication rate is 1, independent of the limit q on reducer size. 3 Since the replication rate is identically 1, there is no tradeoff at all between q and replication rate; i.e., the word-count problem is embarrassingly parallel, as we knew all along.…”
Section: Mapping Schemasmentioning
confidence: 99%
“…But Theorem 3.2 says that r must be at least b/ log 2 (2 b ) = 1. In [3], there is an algorithm called Splitting that, for the case of Hamming distance 1 uses 2 1+b/2 reducers, for some even b. Half of these reducers, or 2 b/2 reducers correspond to the 2 b/2 possible bit strings that may be the first half of an input string.…”
Section: Upper Bound For Hamming Distancementioning
confidence: 99%
See 1 more Smart Citation
“…The similarity self-join problem has been studied extensively. It was introduced by Chaudhuri et al [10], and many works have followed up [4,5,7,10,27,28], including approaches that use parallel computation [1,2,6,13,24]. Discussing all these related papers is out of the scope of this work.…”
Section: Related Workmentioning
confidence: 99%
“…In [4] fuzzy joins are studied. Various algorithms that rely on a single MapReduce job are proposed, which compute all pairs of records with similarity above a certain user-specified threshold, and produce the exact result.…”
Section: Similarity Joinmentioning
confidence: 99%