Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data 2010
DOI: 10.1145/1807167.1807222
|View full text |Cite
|
Sign up to set email alerts
|

Efficient parallel set-similarity joins using MapReduce

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
308
0
6

Year Published

2012
2012
2019
2019

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 390 publications
(327 citation statements)
references
References 18 publications
1
308
0
6
Order By: Relevance
“…Exact set similarity join has been extensively studied in the literature [18,19,20,21,28,31,32,33,34,35,40,44,45]. As we have introduced in Section 3, existing solutions all follow the filtering-verification framework and can be divided into two categories based on the filtering mechanism, namely, prefix-filter based algorithms and partition-filter based algorithms.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Exact set similarity join has been extensively studied in the literature [18,19,20,21,28,31,32,33,34,35,40,44,45]. As we have introduced in Section 3, existing solutions all follow the filtering-verification framework and can be divided into two categories based on the filtering mechanism, namely, prefix-filter based algorithms and partition-filter based algorithms.…”
Section: Related Workmentioning
confidence: 99%
“…Other work focus on processing exact set similarity join in a distributed environment using MapReduce [33,35,40], which is not the focus of this paper.…”
Section: Related Workmentioning
confidence: 99%
“…This work did not consider the specifics of MR and focused on parallel matching while blocking was not performed in parallel. In addition to our own approaches utilized in Dedoop, there are a few further proposals to employ MR for ER (e.g., [24,25]). These approaches do not support advanced features such as load balancing or redundancy-free multi-pass blocking.…”
Section: Related Workmentioning
confidence: 99%
“…We therefore vary the cluster size for entity resolution from 1 up to 100 nodes. We use the CiteseerX dataset 3 from [24] containing 1, 385, 532 publication records. To test Dedoop's load balancing, we apply both Sorted Neighborhood (SN) in combination with the RepSN algorithm [14] as well as Standard Blocking (SB) in combination with the PairRange algorithm [13].…”
Section: Scalabilitymentioning
confidence: 99%
“…Many real-world tasks are expressible in this model. Programmers find the system easy to use and hundreds of MapReduce programs have been implemented; and around one thousand MapReduce jobs are executed on Google clusters every day [3,10,23]. For better understanding and simplicity, we tried to keep the MapReduce explanation figures as simple as possible.…”
Section: Introductionmentioning
confidence: 99%