Gintare Karolina Dziugaite scite author profile

Understanding generalization in deep learning is arguably one of the most important questions in deep learning. Deep learning has been successfully adopted to a large number of problems ranging from pattern recognition to complex decision making, but many recent researchers have raised many concerns about deep learning, among which the most important is generalization. Despite numerous attempts, conventional statistical learning approaches have yet been able to provide a satisfactory explanation on why deep learning works. A recent line of works aims to address the problem by trying to predict the generalization performance through complexity measures. In this competition, we invite the community to propose complexity measures that can accurately predict generalization of models. A robust and general complexity measure would potentially lead to a better understanding of deep learning's underlying mechanism and behavior of deep models on unseen data, or shed light on better generalization bounds. All these outcomes will be important for making deep learning more robust and reliable. * Lead organizer: Yiding Jiang; Scott Yak and Pierre Foret help implement large portion of the infrastructure and the remaining organizers' order is randomized.

show abstract

Information-Theoretic Generalization Bounds for Stochastic Gradient Descent

Neu¹,

Dziugaite²,

Haghifam³

et al. 2021

Preprint

View full text Add to dashboard Cite

Requiem for the max rule?

Shen

Dziugaite

et al. 2015

Vision Research

View full text Add to dashboard Cite

In tasks such as visual search and change detection, a key question is how observers integrate noisy measurements from multiple locations to make a decision. Decision rules proposed to model this process haven fallen into two categories: Bayes-optimal (ideal observer) rules and ad-hoc rules. Among the latter, the maximum-of-outputs (max) rule has been most prominent. Reviewing recent work and performing new model comparisons across a range of paradigms, we find that in all cases except for one, the optimal rule describes human data as well as or better than every max rule either previously proposed or newly introduced here. This casts doubt on the utility of the max rule for understanding perceptual decision-making.

show abstract

Linear Mode Connectivity and the Lottery Ticket Hypothesis

Frankle¹,

Dziugaite²,

Roy³

et al. 2019

Preprint

View full text Add to dashboard Cite

Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates

Negrea¹,

Haghifam²,

Dziugaite³

et al. 2019

Preprint

View full text Add to dashboard Cite

In this work, we improve upon the stepwise analysis of noisy iterative learning algorithms initiated by Pensia, Jog, and Loh (2018) and recently extended by Bu, Zou, and Veeravalli (2019). Our main contributions are significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via datadependent estimates. Our approach is based on the variational characterization of mutual information and the use of data-dependent priors that forecast the minibatch gradient based on a subset of the training samples. Our approach is broadly applicable within the information-theoretic framework of Russo and Zou (2015) and Xu and Raginsky (2017). Our bound can be tied to a measure of flatness of the empirical risk surface. As compared with other bounds that depend on the squared norms of gradients, empirical investigations show that the terms in our bounds are orders of magnitude smaller.

show abstract

Deep Learning on a Data Diet: Finding Important Examples Early in Training

Paul¹,

Ganguli²,

Dziugaite³

2021

Preprint

View full text Add to dashboard Cite

The recent success of deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization, and how do we find them? In this work, we make the striking observation that, on standard vision benchmarks, the initial loss gradient norm of individual training examples, averaged over several weight initializations, can be used to identify a smaller set of training data that is important for generalization. Furthermore, after only a few epochs of training, the information in gradient norms is reflected in the normed error-L2 distance between the predicted probabilities and one hot labels-which can be used to prune a significant fraction of the dataset without sacrificing test accuracy. Based on this, we propose data pruning methods which use only local information early in training, and connect them to recent work that prunes data by discarding examples that are rarely forgotten over the course of training. Our methods also shed light on how the underlying data distribution shapes the training dynamics: they rank examples based on their importance for generalization, detect noisy examples and identify subspaces of the model's data representation that are relatively stable over training. Recently, deep learning has made remarkable progress driven, in part, by training overparameterized models on ever larger datasets. This trend creates new challenges: the large computational resources required pose a roadblock to the democratization of AI. Memory and resource constrained settings, such as on-device computing, require smaller models and datasets. Identifying important training data plays a role in online and active learning. Finally, it is of theoretical interest to understand how individual examples and sub-populations of training examples influence learning. To address these challenges, we propose a scoring method that can be used to identify important and difficult examples early in training, and prune the training dataset without large sacrifices in test accuracy. We also investigate how different sub-populations of the training data identified by our score affect the loss surface and training dynamics of the model.Recent work on pruning data [1,2], can be placed in the broader context of identifying coresets that allow training to approximately the same accuracy as would be possible with the original data [3][4][5][6][7]. These works attempt to identify examples that provably guarantee a small gap in training error on the full dataset. However, due to the nonconvex nature of deep learning, these techniques make conservative estimates that lead to weak theoretical guarantees and are less effective in practice.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.