19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07) 2007
DOI: 10.1109/sbac-pad.2007.32
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable Parallel Deduplication Algorithm

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 7 publications
0
4
0
Order By: Relevance
“…The group has also developed a parallel deduplication algorithm, called FER-APARDA, using probabilistic record linkage, as well as PAREIA (Santos et al, 2007; dos Santos Filho, 2008). PAREIA’s crucial contributions are two-fold: 1) the proposed blocking scheme uses predicates from fields or portions of them, making a junction of disjunctions to prevent input errors to separate true matches from the right blocks.…”
Section: Data Linkage Initiatives In Brazilmentioning
confidence: 99%
“…The group has also developed a parallel deduplication algorithm, called FER-APARDA, using probabilistic record linkage, as well as PAREIA (Santos et al, 2007; dos Santos Filho, 2008). PAREIA’s crucial contributions are two-fold: 1) the proposed blocking scheme uses predicates from fields or portions of them, making a junction of disjunctions to prevent input errors to separate true matches from the right blocks.…”
Section: Data Linkage Initiatives In Brazilmentioning
confidence: 99%
“…This reduces the latency of the hash lookup process during duplicate detection. Scalable dedupe [10] partitions the incoming data stream based on the k-least significant bits of their corresponding fingerprints. These partitioned blocks and the fingerprints are mapped to the respective nodes using a DHT.…”
Section: Related Workmentioning
confidence: 99%
“…There are many tools/software available for record linkage (16)(17)(18)(19). We tested some of these alternatives, but due to the large volume of data, we had trouble with computer performance.…”
Section: Software and Hardwarementioning
confidence: 99%