A Semi-Supervised Framework of Clustering Selection for De-Duplication

Kushagra, Shrinu; Saxena, Hemant; Ilyas, Ihab F.; Ben-David, Shai

doi:10.1109/icde.2019.00027

Cited by 12 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It is important to optimize the execution cost of FoRWaRD, particularly in the case of batch updates with missing information (nulls). It will also important to understand the performance of FoRWaRD (in comparison to alternatives) on machine-learning tasks other than column prediction: record linking [13,29], entity resolution [10,21], data imputation [22,41], data cleaning [1,36,41] and so on.…”

Section: Discussionmentioning

confidence: 99%

Dynamic Database Embeddings with FoRWaRD

Toenshoff¹,

Friedman²,

Grohe³

et al. 2021

Preprint

View full text Add to dashboard Cite

We study the problem of computing an embedding of the tuples of a relational database in a manner that is extensible to dynamic changes of the database. Importantly, the embedding of existing tuples should not change due to the embedding of newly inserted tuples (as database applications might rely on existing embeddings), while the embedding of all tuples, old and new, should retain high quality. This task is challenging since state-of-the-art embedding techniques for structured data, such as (adaptations of) embeddings on graphs, have inherent inter-dependencies among the embeddings of different entities. We present the FoRWaRD algorithm (Foreign Key Random Walk Embeddings for Relational Databases) that draws from embedding techniques for general graphs and knowledge graphs, and is inherently utilizing the schema and its key and foreign-key constraints. We compare FoRWaRD to an alternative approach that we devise by adapting node embeddings for graphs (Node2Vec) to dynamic databases. We show that FoRWaRD is comparable and sometimes superior to state-of-the-art embeddings in the static (traditional) setting, using a collection of downstream tasks of column prediction over geographical and biological domains. More importantly, in the dynamic setting FoRWaRD outperforms the alternatives consistently and often considerably, and features only a mild reduction of quality even when the database consists of mostly newly inserted tuples.

show abstract

Section: Discussionmentioning

confidence: 99%

Dynamic Database Embeddings with FoRWaRD

Toenshoff¹,

Friedman²,

Grohe³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Correlation Clustering [12] solves an optimization task, where the goal is to maximize the sum of the intra-cluster edges, while minimizing the sum of the inter-cluster ones. This is an NP-hard problem that is typically solved through approximations, such as Clustering Aggregation [73] and Restricted Correlation Clustering [109]. The latter is a semi-supervised approach that leverages a small labeled dataset, which is carefully selected via an efficient sampling procedure based on LSH.…”

Section: Clustering Methodsmentioning

confidence: 99%

An Overview of End-to-End Entity Resolution for Big Data

Efthymiou²,

et al. 2020

View full text Add to dashboard Cite

One of the most critical tasks for improving data quality and increasing the reliability of data analytics is Entity Resolution (ER), which aims to identify different descriptions that refer to the same real-world entity. Despite several decades of research, ER remains a challenging problem. In this survey, we highlight the novel aspects of resolving Big Data entities when we should satisfy more than one of the Big Data characteristics simultaneously (i.e., Volume and Velocity with Variety). We present the basic concepts, processing steps, and execution strategies that have been proposed by database, semantic Web, and machine learning communities in order to cope with the loose structuredness , extreme diversity , high speed, and large scale of entity descriptions used by real-world applications. We provide an end-to-end view of ER workflows for Big Data, critically review the pros and cons of existing methods, and conclude with the main open research directions.

show abstract

“…The latter can undeniably cause jumping. That brings us to the conclusion that Stadion should not be used, with this formulation, for applications with large K w.r.t N , for instance deduplication [54], and that further investigations are needed in this context.…”

Section: B1 Importance Study With Fanovamentioning

confidence: 98%

Selecting the Number of Clusters $K$ with a Stability Trade-off: an Internal Validation Criterion

Mourer¹,

Forest²,

Lebbah³

et al. 2020

Preprint

View full text Add to dashboard Cite

Model selection is a major challenge in non-parametric clustering. There is no universally admitted way to evaluate clustering results for the obvious reason that there is no ground truth against which results could be tested, as in supervised learning. The difficulty to find a universal evaluation criterion is a direct consequence of the fundamentally ill-defined objective of clustering. In this perspective, clustering stability has emerged as a natural and model-agnostic principle: an algorithm should find stable structures in the data. If data sets are repeatedly sampled from the same underlying distribution, an algorithm should find similar partitions. However, it turns out that stability alone is not a well-suited tool to determine the number of clusters. For instance, it is unable to detect if the number of clusters is too small. We propose a new principle for clustering validation: a good clustering should be stable, and within each cluster, there should exist no stable partition. This principle leads to a novel internal clustering validity criterion based on between-cluster and within-cluster stability, overcoming limitations of previous stability-based methods. We empirically show the superior ability of additive noise to discover structures, compared with sampling-based perturbation. We demonstrate the effectiveness of our method for selecting the number of clusters through a large number of experiments and compare it with existing evaluation methods.

show abstract

A Semi-Supervised Framework of Clustering Selection for De-Duplication

Cited by 12 publications

References 21 publications

Dynamic Database Embeddings with FoRWaRD

Dynamic Database Embeddings with FoRWaRD

An Overview of End-to-End Entity Resolution for Big Data

Selecting the Number of Clusters $K$ with a Stability Trade-off: an Internal Validation Criterion

Contact Info

Product

Resources

About