Ruoxi Jia scite author profile

Given a data set D containing millions of data points and a data consumer who is willing to pay for $X to train a machine learning (ML) model over D, how should we distribute this $X to each data point to reflect its "value"? In this paper, we define the "relative value of data" via the Shapley value, as it uniquely possesses properties with appealing real-world interpretations, such as fairness, rationality and decentralizability. For general, bounded utility functions, the Shapley value is known to be challenging to compute: to get Shapley values for all N data points, it requires O(2 N ) model evaluations for exact computation and O(N log N) for ( , δ)-approximation.In this paper, we focus on one popular family of ML models relying on K-nearest neighbors (KNN). The most surprising result is that for unweighted KNN classifiers and regressors, the Shapley value of all N data points can be computed, exactly, in O(N log N) time -an exponential improvement on computational complexity! Moreover, for ( , δ)-approximation, we are able to develop an algorithm based on Locality Sensitive Hashing (LSH) with only sublinear complexity O(N h( ,K) log N) when is not too small and K is not too large. We empirically evaluate our algorithms on up to 10 million data points and even our exact algorithm is up to three orders of magnitude faster than the baseline approximation algorithm. The LSH-based approximation algorithm can accelerate the value calculation process even further.We then extend our algorithms to other scenarios such as (1) weighed KNN classifiers, (2) different data points are clustered by different data curators, and (3) there are data analysts providing computation who also requires proper valuation. Some of these extensions, although also being improved exponentially, are less practical for exact computation (e.g., O(N K ) complexity for weighted KNN). We thus propose a Monte Carlo approximation algorithm, which is O(N(log N) 2 /(log K) 2 ) times more efficient than the baseline approximation algorithm.

show abstract

A Principled Approach to Data Valuation for Federated Learning

Wang

Rausch

Zhang

et al. 2020

View full text Add to dashboard Cite

Federated Learning (FL), a distributed learning paradigm that scales on-device learning collaboratively, has emerged as a promising approach for decentralized AI applications. Local optimization methods such as Federated Averaging (FedAvg) are the most prominent methods for FL applications. Despite their simplicity and popularity, the theoretical understanding of local optimization methods is far from clear. This dissertation aims to advance the theoretical foundation of local methods in the following three directions.First, we establish sharp bounds for FedAvg, the most popular algorithm in Federated Learning. We demonstrate how FedAvg may suffer from a notion we call iterate bias, and how an additional third-order smoothness assumption may mitigate this effect and lead to better convergence rates. We explain this phenomenon from a Stochastic Differential Equation (SDE) perspective.Second, we propose Federated Accelerated Stochastic Gradient Descent (FedAc), the first principled acceleration of FedAvg, which provably improves the convergence rate and communication efficiency.Our technique uses on a potential-based perturbed iterate analysis, a novel stability analysis of generalized accelerated SGD, and a strategic tradeoff between acceleration and stability.Third, we study the Federated Composite Optimization problem, which extends the classic smooth setting by incorporating a shared non-smooth regularizer. We show that direct extensions of FedAvg may suffer from the "curse of primal averaging," resulting in slow convergence. As a solution, we propose a new primal-dual algorithm, Federated Dual Averaging, which overcomes the curse of primal averaging by employing a novel inter-client dual averaging procedure.1. Establishing sharp understanding of the existing FL algorithms.

show abstract

REFIT: A Unified Watermark Removal Framework For Deep Learning Systems With Limited Data

Wang

Bender

Ding

et al. 2021

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ruoxi Jia

The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks

Efficient task-specific data valuation for nearest neighbor algorithms

A Principled Approach to Data Valuation for Federated Learning

REFIT: A Unified Watermark Removal Framework For Deep Learning Systems With Limited Data

Contact Info

Product

Resources

About