Entity resolution on-demand

Simonini, Giovanni; Zecchini, Luca; Bergamaschi, Sonia; Naumann, Felix

doi:10.14778/3523210.3523226

Cited by 9 publications

(5 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We showed that four new weighting schemes give rise to feature sets that outperform the existing ones [25], while a very small, balanced training set with just 50 labelled instances suffices for high effectiveness, high time efficiency and high scalability. In the future, we will apply our approaches to Progressive ER [29,[33][34][35].…”

Section: Discussionmentioning

confidence: 99%

Generalized supervised meta-blocking

Gagliardelli

Papadakis²,

Simonini

et al. 2022

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

Entity Resolution is a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced by Meta-blocking techniques that leverage the entity co-occurrence patterns inside blocks: first, pairs of candidate entities are weighted in proportion to their matching likelihood, and then, pruning discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as features. Through extensive experiments, we identify the best pruning algorithms, their optimal sets of features, as well as the minimum possible size of the training set.

show abstract

Section: Discussionmentioning

confidence: 99%

Generalized supervised meta-blocking

Gagliardelli

Papadakis²,

Simonini

et al. 2022

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The latter requirement is significantly challenging, since it demands to correctly sort the entities even before they are generated using data fusion, only relying on the original records that can produce them. In this section, we give an overview of how BrewER overcomes these challenges, while the detailed description of the algorithm is provided in the research paper [13].…”

Section: An Overview Of Brewermentioning

confidence: 99%

“…We will provide users with a set of dirty datasets, composed of the reference datasets used in the research paper [13] (i.e., cameras, USB sticks, and organizations) plus several additional ones (e.g., an extended version of cameras and further datasets of commercial products from the Alaska benchmark [4] and multiple datasets from the Magellan Data Repository 2 ). These datasets cover different domains and are highly heterogeneous in terms of cleanliness, number of attributes, and number of records, ranging from the 1K records of the smallest subset of USB sticks to the 29K records of the full camera dataset, on which the batch approach would take several hours to perform the entire cleaning process [13]. Each dataset comes with its ground truth, so the users will be able to assess the efficacy (precision/recall) of each step in the ER pipeline and the correctness of the results of the given queries.…”

Section: Demonstration Scenariosmentioning

confidence: 99%

“…To clean only the data useful to answer the user's queries, we previously presented BrewER [13], an algorithm and a framework to perform ER in an on-demand fashion. BrewER evaluates SQL SP (Selection and Projection) queries with ordering on the dirty data and returns the results as if they were issued on the cleaned data.…”

mentioning

confidence: 99%

“…Previous work had been proposed to perform ER at query time or progressively, but BrewER is the first contribution to propose a solution that combines the two [13]. In fact, query-driven approaches [1,2] aim at performing ER only on the portion of the dataset that is meaningful for answering the query, but they are not designed to support the progressive emission of the results (thus, they still operate in a batch way) and enable to use a limited range of aggregation functions (e.g., the average or the mode cannot be supported).…”

mentioning

confidence: 99%

See 2 more Smart Citations

BrewER: Entity Resolution On-Demand

Zecchini,

Simonini,

Bergamaschi

et al. 2023

Proc. VLDB Endow.

View full text Add to dashboard Cite

The task of entity resolution (ER) aims to detect multiple records describing the same real-world entity in datasets and to consolidate them into a single consistent record. ER plays a fundamental role in guaranteeing good data quality, e.g., as input for data science pipelines. Yet, the traditional approach to ER requires cleaning the entire data before being able to run consistent queries on it; hence, users struggle to tackle common scenarios with limited time or resources (e.g., when the data changes frequently or the user is only interested in a portion of the dataset for the task). We previously introduced BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data, according to a priority defined by the user. In this demonstration, we show how BrewER can be exploited to ease the burden of ER, allowing data scientists to save a significant amount of resources for their tasks.

show abstract

EdgER: Entity Resolution at the Edge for Next Generation Web Systems

Martella,

Longo

2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Entity resolution on-demand

Cited by 9 publications

References 30 publications

Generalized supervised meta-blocking

Generalized supervised meta-blocking

BrewER: Entity Resolution On-Demand

EdgER: Entity Resolution at the Edge for Next Generation Web Systems

Contact Info

Product

Resources

About