2019
DOI: 10.1016/j.is.2019.03.006
|View full text |Cite
|
Sign up to set email alerts
|

Scaling entity resolution: A loosely schema-aware approach

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 18 publications
(19 citation statements)
references
References 41 publications
0
19
0
Order By: Relevance
“…We have not been able to successfully execute PMB on some of the larger sets (VAR50, VAR107, VAR530); it fails with out-of-memory errors and we have been unable to get it to complete. We ran into similar issues when running BLAST [26] on our huge datasets, which we expected given that they broadcast hash maps of record ID → blocking keys to every node. For our large datasets, this single broadcast map would be multiple TBs of memory.…”
Section: Comparing Hashed Dynamic Blocking To Other Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We have not been able to successfully execute PMB on some of the larger sets (VAR50, VAR107, VAR530); it fails with out-of-memory errors and we have been unable to get it to complete. We ran into similar issues when running BLAST [26] on our huge datasets, which we expected given that they broadcast hash maps of record ID → blocking keys to every node. For our large datasets, this single broadcast map would be multiple TBs of memory.…”
Section: Comparing Hashed Dynamic Blocking To Other Methodsmentioning
confidence: 99%
“…BLAST [26] is a schema-aware meta-blocking approach. One innovation of BLAST is making records that share high entropy attributes like name more likely to be pairwise-compared than low entropy attributes like year of birth.…”
Section: Meta-blocking Based Approachesmentioning
confidence: 99%
“…Both strategies rely on the load balancing algorithm MaxBlock [55] to avoid the underutilization of the available resources. BLAST is parallelized in [162], exploiting the broadcast join of Apache Spark for very high efficiency. Table 2 presents an overview of the Block Processing methods discussed above.…”
Section: Comparison Cleaningmentioning
confidence: 99%
“…For Comparison Cleaning, we adopt the approach described in [60]. First, the Entity Index I E is created in the form of an RDD, mapping each profile id to the ids of the blocks that contain it.…”
Section: Parallel Executionmentioning
confidence: 99%