2018
DOI: 10.1007/s11227-018-2328-3
|View full text |Cite
|
Sign up to set email alerts
|

Exploring hybrid parallel systems for probabilistic record linkage

Abstract: Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously explo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 14 publications
(13 reference statements)
0
4
0
Order By: Relevance
“…To circumvent scalability challenges over big data sets, different approaches have been used in the literature, such as parallelism/distribution and blocking (or indexing) strategies, as well as their combinations (Christen, 2008; Pita et al, 2018). Other initiatives have also proposed the use of cluster-based platforms, multi-processors or graphics processing units (GPUs) (Boratto et al, 2018; Pita et al, 2018). Blocking and indexing step generates pairs of candidate records pertaining to the same comparison blocks (Christen, 2012).…”
Section: Data Linkagementioning
confidence: 99%
See 1 more Smart Citation
“…To circumvent scalability challenges over big data sets, different approaches have been used in the literature, such as parallelism/distribution and blocking (or indexing) strategies, as well as their combinations (Christen, 2008; Pita et al, 2018). Other initiatives have also proposed the use of cluster-based platforms, multi-processors or graphics processing units (GPUs) (Boratto et al, 2018; Pita et al, 2018). Blocking and indexing step generates pairs of candidate records pertaining to the same comparison blocks (Christen, 2012).…”
Section: Data Linkagementioning
confidence: 99%
“…AtyImo, in comparison to previous linkage tools freely available, has reasonably better accuracy and shorter execution time with a major advantage to scale upward to huge databases (Pita et al, 2018). The current version of AtyImo based on the NVIDIA’s CUDA library is able to probabilistically link databases of up 80 million records in around 60 s over multiple GPU architectures (Boratto et al, 2018).…”
Section: Record Linkage Tools Developed And/or Used In Brazilmentioning
confidence: 99%
“…Bloom filters [30,31], which transform bigrams from the linkage key attributes into a binary vector, are used for similarity calculation (matching). Atylmo has proven to be quite effective, providing 93% to 97% of accuracy (true positive rate) depending on the databases being linked [32]. CIDACS-RL, another linkage tool designed over Apache Lucene 12 , uses a novel approach based on an indexing search and sorting algorithm to perform information retrieval.…”
Section: Data Linkagementioning
confidence: 99%
“…Despite its potential for significant improvements in runtime performance, there has not been any further work published on P4Join using larger data sets or on clusters of GPU nodes. More recently, Boratto et al [ 25 ] evaluated a hybrid algorithm using both GPUs and central processing units (CPUs) with much larger data sets. Although restricted to single (highly specified) machines, these evaluations show promise provided that the approach can be applied within a compute cluster.…”
Section: Introductionmentioning
confidence: 99%