Shrinu Kushagra scite author profile

Data de-duplication is the task of detecting multiple records that correspond to the same real-world entity in a database. In this work, we view de-duplication as a clustering problem where the goal is to put records corresponding to the same physical entity in the same cluster and putting records corresponding to different physical entities into different clusters.We introduce a framework which we call promise correlation clustering. Given a complete graph G with the edges labelled 0 and 1, the goal is to find a clustering that minimizes the number of 0 edges within a cluster plus the number of 1 edges across different clusters (or correlation loss). The optimal clustering can also be viewed as a complete graph G * with edges corresponding to points in the same cluster being labelled 0 and other edges being labelled 1. Under the promise that the edge difference between G and G * is "small", we prove that finding the optimal clustering (or G * ) is still NP-Hard. [Ashtiani et al., 2016] introduced the framework of semi-supervised clustering, where the learning algorithm has access to an oracle, which answers whether two points belong to the same or different clusters. We further prove that even with access to a same-cluster oracle, the promise version is NP-Hard as long as the number queries to the oracle is not too large (o(n) where n is the number of vertices).Given these negative results, we consider a restricted version of correlation clustering. As before, the goal is to find a clustering that minimizes the correlation loss. However, we restrict ourselves to a given class F of clusterings. We offer a semi-supervised algorithmic approach to solve the restricted variant with success guarantees.

show abstract

Finding Meaningful Cluster Structure Amidst Background Noise

Kushagra

Samadi

Ben-David

2016

View full text Add to dashboard Cite

Semi-supervised clustering for de-duplication

Kushagra¹,

Ben-David²,

Ilyas³

2018

Preprint

View full text Add to dashboard Cite

Information Preserving Dimensionality Reduction

Kushagra

Ben-David

2015

View full text Add to dashboard Cite

PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX

Zhou¹,

Kumar²,

Lázaro-Gredilla³

et al. 2022

Preprint

View full text Add to dashboard Cite

PGMax is an open-source Python package for easy specification of discrete Probabilistic Graphical Models (PGMs) as factor graphs, and automatic derivation of efficient and scalable loopy belief propagation (LBP) implementation in JAX. It supports general factor graphs, and can effectively leverage modern accelerators like GPUs for inference. Compared with existing alternatives, PGMax obtains higher-quality inference results with orders-ofmagnitude inference speedups. PGMax additionally interacts seamlessly with the rapidly growing JAX ecosystem, opening up exciting new possibilities. Our source code, examples and documentation are available at https://github.com/vicariousinc/PGMax.

show abstract

Adjoined Networks: A Training Paradigm with Applications to Network Compression

Nath¹,

Kushagra²,

Yang³

2020

Preprint

View full text Add to dashboard Cite

On sampling from data with duplicate records

Heidari¹,

Kushagra²,

Ilyas³

2020

Preprint

View full text Add to dashboard Cite

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of duplicates. We accomplish this by a two-stage process. In the first step, we estimate the frequencies of all the entities in the database. In the second step, we use rejection sampling to obtain a (approximately) uniform sample from the set of entities. However, efficiently estimating the frequency of all the entities is a non-trivial task and not attainable in the general case. Hence, we consider various natural properties of the data under which such frequency estimation (and consequently uniform sampling) is possible. Under each of those assumptions, we provide sampling algorithms and give proofs of the complexity (both statistical and computational) of our approach. We complement our study by conducting extensive experiments on both real and synthetic datasets.Preprint. Under review.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Shrinu Kushagra

Multi-Pivot Quicksort: Theory and Experiments

A Semi-Supervised Framework of Clustering Selection for De-Duplication

Finding Meaningful Cluster Structure Amidst Background Noise

Semi-supervised clustering for de-duplication

Information Preserving Dimensionality Reduction

PGMax: Factor Graphs for Discrete Probabilistic Graphical Models and Loopy Belief Propagation in JAX

Adjoined Networks: A Training Paradigm with Applications to Network Compression

On sampling from data with duplicate records

Contact Info

Product

Resources

About