Semi-supervised clustering for de-duplication

Kushagra, Shrinu; Ben-David, Shai; Ilyas, Ihab F.

doi:10.48550/arxiv.1810.04361

Cited by 2 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gamlath, Huang and Svensson extended the above result when approximation is allowed [15]. Ailon et al [2] studied correlation clustering with same-cluster queries and showed that there exists an (1 + ) approximation for correlation clustering where the number of queries is a (large) polynomial in k. Our algorithms are different from those in [2] in that our guarantees are parameterized by C OP T rather than by k. Kushagra et al [19] study a restricted version of correlation clustering where the valid clusterings are provided by a set of hierarchical trees and provide an algorithm using same-cluster queries for a related setting, giving guarantees in terms of the size of the input instance (or the VC dimension of the input instance) rather than C OP T . [20] studied, among other clustering problems, a random instance of correlation clustering under same-cluster queries.…”

Section: Related Workmentioning

confidence: 91%

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Saha,

Subramanian

2019

Preprint

View full text Add to dashboard Cite

Several clustering frameworks with interactive (semi-supervised) queries have been studied in the past. Recently, clustering with same-cluster queries has become popular. An algorithm in this setting has access to an oracle with full knowledge of an optimal clustering, and the algorithm can ask the oracle queries of the form, "Does the optimal clustering put vertices u and v in the same cluster?" Due to its simplicity, this querying model can easily be implemented in real crowd-sourcing platforms and has attracted a lot of recent work.In this paper, we study the popular correlation clustering problem (Bansal et al., 2002) under the same-cluster querying framework. Given a complete graph G = (V, E) with positive and negative edge labels, correlation clustering objective aims to compute a graph clustering that minimizes the total number of disagreements, that is the negative intra-cluster edges and positive inter-cluster edges. In a recent work, Ailon et al. (2018b) provided an approximation algorithm for correlation clustering that approximates the correlation clustering objective within (1 + ) with O( k 14 log n log k 6 * B. Saha is partially supported by an NSF CAREER Award CCF 1652303, a Google Faculty Award and an Alfred P. Sloan fellowship.

show abstract

Section: Related Workmentioning

confidence: 91%

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Saha,

Subramanian

2019

Preprint

View full text Add to dashboard Cite

show abstract

“…where k is the number of non-singleton clusters and k 1 and k 2 are known. In Section B.3, we describe a principled approach to select the right value of k based on the framework of SSC (semi-supervised clustering) introduced in [19,20] and describe our complete sampling approach.…”

Section: Lsh-based Samplingmentioning

confidence: 99%

“…Note the each C ki is a clustering of the given dataset. We then use the SSC framework to select the best clustering from G. Owing to space constraints, we describe the details of the SSC algorithm (almost identical to the algorithm in [20]) and related proofs in the appendix section. We describe our "clustering and hashing" based sampling algorithm and then prove the main result from this section.…”

Section: Semi-supervised Clusteringmentioning

confidence: 99%

On sampling from data with duplicate records

Heidari¹,

Kushagra²,

Ilyas³

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.Preprint. Under review.

show abstract

Semi-supervised clustering for de-duplication

Cited by 2 publications

References 0 publications

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost

On sampling from data with duplicate records

Contact Info

Product

Resources

About