Finding High-Value Training Data Subset Through Differentiable Convex Programming

Das, Soumi; Singh, Arshdeep; Chatterjee, Saptarshi; Bhattacharya, Suparna; Bhattacharya, Sourangshu

doi:10.1007/978-3-030-86520-7_41

Cited by 5 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, our framework shares the same spirit as the traditional label efficiency research. Data valuation In the literature, other than active learning, there exists many techniques to quantify the importance of individual samples, e.g., influence function [Koh and Liang 2017] and its variants [Wu, Weimer, and Davidson 2021], Glister [Killamsetty et al 2021], HOST-CP [Das et al 2021], TracIn [Pruthi et al 2020], DVRL [Yoon, Arik, and Pfister 2020] and Data Shapley value [Ghorbani and Zou 2019]. However, among these methods, Data Shapley value [Ghorbani and Zou 2019] is very computationally expensive while others rely on the assumption that a set of "clean" validation samples (or meta samples) are given, which is thus not suitable for our framework (we have more detailed discussions on Data Shapley value and its extensions in Appendix "Appendix: more related work").…”

Section: Related Workmentioning

confidence: 99%

Learning to Select Pivotal Samples for Meta Re-weighting

Wu¹,

Stein²,

Gardner³

et al. 2023

Preprint

View full text Add to dashboard Cite

Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem called the meta re-weighting problem, whose goal is to optimize performance on a small set of perfect pivotal samples, called meta samples. Many approaches have been proposed to efficiently solve this problem. However, all of them assume that a perfect meta sample set is already provided while we observe that the selections of meta sample set is performancecritical. In this paper, we study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance in the meta re-weighting setting. We propose a learning framework which reduces the meta samples selection problem to a weighted K-means clustering problem through rigorously theoretical analysis. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC), for balancing performance and computational efficiency. Empirical studies demonstrate the performance advantage of our methods over various baseline methods.

show abstract

Section: Related Workmentioning

confidence: 99%

Learning to Select Pivotal Samples for Meta Re-weighting

Wu¹,

Stein²,

Gardner³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…It finds its application in subset selection, providing explanations to predictions, diagnosing mislabelled examples and so on. There have been several works in the literature of data valuation encompassing influence functions [10], shapley values [7], reinforcement learning [23], differentiable convex programming [5], tracking training trajectories [14] and more. However, apart from [10] and [14], all the other above-mentioned techniques are expensive to be scaled on large datasets and models since they merge the training and scoring datapoints in a combined framework.…”

Section: Related Workmentioning

confidence: 99%

“…Finding "influential" datapoints in a training dataset, also known as data valuation [10,14] and datasubset selection [5], has emerged as an important sub-problem for many modern deep learning application domains, e.g. Data-centric AI [7], Explainability in trusted AI [15], [2], debugging the training process [10], scalable supervised deep learning [9].…”

Section: Introductionmentioning

confidence: 99%

“…While many state-of-the art data valuation methods [5,23] are able to identify high-value (test accuracy) and small-sized subsets of training data, their computational cost is prohibitive for the current application. Recently, TracIn [14] proposed to utilize the training trajectory of SGD algorithm to estimate the reduction in validation loss (value function) due to each training datapoint.…”

Section: Introductionmentioning

confidence: 99%

“…One of the main contributions of this paper is to propose CheckSel, an online sparse approximation algorithm for checkpoint selection. Another drawback of data valuation-based subset selection algorithms is that the selected subset often lacks diversity [5]. Our second contribution in this paper is the SimSel algorithm, which given the contribution score-vectors (over validation datapoints) for different training datapoints, selects the most diverse subset of training data using a sub-modular subset selection formulation [13].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CheckSel: Efficient and Accurate Data-valuation Through Online Checkpoint Selection

Das¹,

Sagarkar²,

Bhattacharya³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Data valuation and subset selection have emerged as valuable tools for application-specific selection of important training data. However, the efficiencyaccuracy tradeoffs of state-of-the-art methods hinder their widespread application to many AI workflows. In this paper, we propose a novel 2-phase solution to this problem. Phase 1 selects representative checkpoints from an SGD-like training algorithm, which are used in phase 2 to estimate the approximate the training data values, e.g. decrease in validation loss due to each training point. A key contribution of this paper is CheckSel, an Orthogonal Matching Pursuit-inspired online sparse approximation algorithm for checkpoint selection in the online setting, where the features are revealed one at a time. Another key contribution is the study of data valuation in the domain adaptation setting, where a data value estimator obtained using checkpoints from training trajectory in source domain training dataset is used for data valuation in a target domain training dataset. Experimental results on benchmark datasets show the proposed algorithm outperforms recent baseline methods by upto ∼ 30% in terms of test accuracy while incurring similar computational burden, for both standalone and domain adaptation settings.

show abstract