Scaling entity resolution: A loosely schema-aware approach

Simonini, Giovanni; Gagliardelli, Luca; Bergamaschi, Sonia; Jagadish, H. V.

doi:10.1016/j.is.2019.03.006

Cited by 18 publications

(19 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have not been able to successfully execute PMB on some of the larger sets (VAR50, VAR107, VAR530); it fails with out-of-memory errors and we have been unable to get it to complete. We ran into similar issues when running BLAST [26] on our huge datasets, which we expected given that they broadcast hash maps of record ID → blocking keys to every node. For our large datasets, this single broadcast map would be multiple TBs of memory.…”

Section: Comparing Hashed Dynamic Blocking To Other Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Scalable Blocking for Very Large Databases

Borthwick¹,

Ash²,

Pang³

et al. 2020

Preprint

View full text Add to dashboard Cite

In the field of database deduplication, the goal is to find approximately matching records within a database. Blocking is a typical stage in this process that involves cheaply finding candidate pairs of records that are potential matches for further processing.We present here Hashed Dynamic Blocking, a new approach to blocking designed to address datasets larger than those studied in most prior work. Hashed Dynamic Blocking (HDB) extends Dynamic Blocking, which leverages the insight that rare matching values and rare intersections of values are predictive of a matching relationship. We also present a novel use of Locality Sensitive Hashing (LSH) to build blocking key values for huge databases with a convenient configuration to control the trade-off between precision and recall. HDB achieves massive scale by minimizing data movement, using compact block representation, and greedily pruning ineffective candidate blocks using a Count-min Sketch approximate counting data structure. We benchmark the algorithm by focusing on real-world datasets in excess of one million rows, demonstrating that the algorithm displays linear time complexity scaling in this range. Furthermore, we execute HDB on a 530 million row industrial dataset, detecting 68 billion candidate pairs in less than three hours at a cost of $307 on a major cloud service.

show abstract

Section: Comparing Hashed Dynamic Blocking To Other Methodsmentioning

confidence: 99%

“…BLAST [26] is a schema-aware meta-blocking approach. One innovation of BLAST is making records that share high entropy attributes like name more likely to be pairwise-compared than low entropy attributes like year of birth.…”

Section: Meta-blocking Based Approachesmentioning

confidence: 99%

Scalable Blocking for Very Large Databases

Borthwick¹,

Ash²,

Pang³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Both strategies rely on the load balancing algorithm MaxBlock [55] to avoid the underutilization of the available resources. BLAST is parallelized in [162], exploiting the broadcast join of Apache Spark for very high efficiency. Table 2 presents an overview of the Block Processing methods discussed above.…”

Section: Comparison Cleaningmentioning

confidence: 99%

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.

show abstract

“…For Comparison Cleaning, we adopt the approach described in [60]. First, the Entity Index I E is created in the form of an RDD, mapping each profile id to the ids of the blocks that contain it.…”

Section: Parallel Executionmentioning

confidence: 99%

Three-dimensional Entity Resolution with JedAI

Papadakis

Mandilaras

Gagliardelli

et al. 2020

Information Systems

Self Cite

View full text Add to dashboard Cite

Entity Resolution (ER) is the task of detecting different entity profiles that describe the same real-world objects. To facilitate its execution, we have developed JedAI, an open-source system that puts together a series of state-of-the-art ER techniques that have been proposed and examined independently, targeting parts of the ER end-to-end pipeline. This is a unique approach, as no other ER tool brings together so many established techniques. Instead, most ER tools merely convey a few techniques, those primarily developed by their creators. In addition to democratizing ER techniques, JedAI goes beyond the other ER tools by offering a series of unique characteristics: (i) It allows for building and benchmarking millions of ER pipelines. (ii) It is the only ER system that applies seamlessly to any combination of structured and/or semi-structured data. (iii) It constitutes the only ER system that runs seamlessly both on stand-alone computers and clusters of computers-through the parallel implementation of all algorithms in Apache Spark. (iv) It supports two different end-to-end workflows for carrying out batch ER (i.e., budget-agnostic), a schema-agnostic one based on blocks, and a schema-based one relying on similarity joins. (v) It adapts both end-to-end workflows to budget-aware (i.e., progressive) ER. We present in detail all features of JedAI, stressing the core characteristics that enhance its usability, and boost its versatility and effectiveness. We also compare it to the state-of-the-art in the field, qualitatively and quantitatively, demonstrating its state-of-the-art performance over a variety of large-scale datasets from different domains.

show abstract

Scaling entity resolution: A loosely schema-aware approach

Cited by 18 publications

References 41 publications

Scalable Blocking for Very Large Databases

Scalable Blocking for Very Large Databases

An Overview of End-to-End Entity Resolution for Big Data

Three-dimensional Entity Resolution with JedAI

Contact Info

Product

Resources

About