2011
DOI: 10.1007/s00450-011-0177-x
|View full text |Cite
|
Sign up to set email alerts
|

Multi-pass sorted neighborhood blocking with MapReduce

Abstract: Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution using Sorting Neighborhood blocking (SN). We propose and evaluate two efficient MapReducebased implementations for single-and multi-pass SN that either use multiple MapReduce jobs or apply a tailored data replication. We also propose an automatic data partitionin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
53
0

Year Published

2012
2012
2022
2022

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 76 publications
(55 citation statements)
references
References 15 publications
0
53
0
Order By: Relevance
“…Using the most promising approaches from PPRL blocking, massive parallel or GPU implementations should be easy by comparison, since the results of blocks are in many applications independent. Although, parallel implementations of sorted neighborhood (Kolb et al 2012) and canopy clustering (https://mahout.apache.org) have been published, practical parallel implementations of PPRL blocking for secure environments are currently not readily available.…”
Section: Future Researchmentioning
confidence: 99%
“…Using the most promising approaches from PPRL blocking, massive parallel or GPU implementations should be easy by comparison, since the results of blocks are in many applications independent. Although, parallel implementations of sorted neighborhood (Kolb et al 2012) and canopy clustering (https://mahout.apache.org) have been published, practical parallel implementations of PPRL blocking for secure environments are currently not readily available.…”
Section: Future Researchmentioning
confidence: 99%
“…It, thus, only compares entities within a window of a predetermined size w. A MR-based implementation of SN must ensure that reduce tasks can evaluate the entities in sort order and that the windows of neighboring entities are available despite the need to distribute entities among different reduce tasks. Dedoop uses the RepSN algorithm of [14] for this purpose. The map function determines the blocking key for each input entity and applies a specific range partitioning function to redistribute entities among reduce tasks so that the sort order according to the blocking key is pre- served.…”
Section: Mapreduce Jobs For Entity Resolution With Dedoopmentioning
confidence: 99%
“…All approaches utilze the BDM information determined by the analysis job. We only sketch the main ideas of the approaches in the following and refer to [14,13] for more detailed descriptions. Figure 4 illustrates the BlockSplit load balancing approach for the example from Figure 1.…”
Section: Dedoop's Load Balancing Mechanismmentioning
confidence: 99%
See 2 more Smart Citations