27th International Conference on Distributed Computing Systems (ICDCS '07) 2007
DOI: 10.1109/icdcs.2007.96
|View full text |Cite
|
Sign up to set email alerts
|

D-Swoosh: A Family of Algorithms for Generic, Distributed Entity Resolution

Abstract: Entity Resolution (ER) matches and merges records that refer to the same real-world entities, and is typically a compute-intensive process due to complex matching functions and high data volumes. We present a family of algorithms, D-Swoosh, for distributing the ER workload across multiple processors. The algorithms use generic match and merge functions, and ensure that new merged records are distributed to processors that may have matching records. We perform a detailed performance evaluation on a testbed of 1… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2008
2008
2018
2018

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(23 citation statements)
references
References 22 publications
(28 reference statements)
0
23
0
Order By: Relevance
“…Though our server had multiple processors, we did not exploit parallelism. This is a topic we are addressing in a separate paper [7].…”
Section: Experimental Settingmentioning
confidence: 96%
See 1 more Smart Citation
“…Though our server had multiple processors, we did not exploit parallelism. This is a topic we are addressing in a separate paper [7].…”
Section: Experimental Settingmentioning
confidence: 96%
“…It is also possible to distribute the ER computations across multiple processors, in order to handle larger data sets. In [7] we study various strategies for distributing the work done by R-Swoosh.…”
Section: Scalabilitymentioning
confidence: 99%
“…However, for every tuple pair, computing the satisfied set of predicates is independent of each other. In our implementation we use the Grid Scheme strategy, a standard approach to scale in entity resolution [4]. We partition the data into B blocks, and define each task as a comparison of tuples from two blocks.…”
Section: Proofmentioning
confidence: 99%
“…The model and the algorithms are extended in [32] for handling approximate results as records with confidences. [1] adapts the algorithms to a distributed environment. Our generic ER algorithms for relational databases were published in [35].…”
Section: Related Workmentioning
confidence: 99%
“…If we use multiple features, the unified feature-based match function is defined as a disjunction. For some feature set f 1 The third aspect of a feature is indexability. To ensure the efficient computation of feature-based matches, we expect features and their match functions to be indexable.…”
Section: Featuresmentioning
confidence: 99%