Neural Kernels Without Tangents

Shankar, Venkatesh; Fang, Alex Chengyu; Guo, Wenhui; Fridovich-Keil, Sara; Schmidt, Ludwig; Ragan-Kelley, Jonathan; Recht, Benjamin

doi:10.48550/arxiv.2003.02237

Cited by 8 publications

(24 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use batch size 125, train for 140 epochs, and decay the learning rate thrice at epochs 80, 100 and 120 each by a factor 0.2. 8 We use standard random crop, random flip, normalization, and cutout augmentation [77] for the training data.…”

Section: A Experiments Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Allen-Zhu

2020

Preprint

View full text Add to dashboard Cite

We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the same architecture, trained using the same algorithm on the same data set, and they only differ by the random seeds used in the initialization.We empirically show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory, especially differently from ensemble of random feature mappings or the neural-tangent-kernel feature mappings, and is potentially out of the scope of existing theorems. Thus, to properly understand ensemble and knowledge distillation in deep learning, we develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble-that can be used in knowledge distillation-comparing to the true data labels. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.

show abstract

Section: A Experiments Detailsmentioning

confidence: 99%

“…We use batch size 50, train for 200 epochs, and decay the learning rate twice at epochs 140 and 170 each by a factor 0.2. We use ZCA data preprocessing which has been reported very helpful for improving neural kernel methods' performance together with cutout augmentation [77]. 9…”

Section: A Experiments Detailsmentioning

confidence: 99%

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Allen-Zhu

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…[5] defined a family of "arc-cosine" kernels to imitate the computations performed by infinitely wide networks in expectation. [4] proposed kernels that are equivalent to expectations of finite-widths random networks. [7] presented exact computations of some kernels, using which the kernel regression models can be shown to be the limit (in widths and training time) of fully-trainable, infinitely wide fully-connected networks trained with gradient descent.…”

Section: Related Work a Connecting Neural Network With Kernel Methodsmentioning

confidence: 99%

“…In fact, if we ensure that there is at least one example from each class in the training data, the modular approach needed as few as 10 randomly chosen examples to achieve 94.88% accuracy, that is, a single randomly chosen example per class. 4 These observations suggest that our modular training method can almost completely rely on weak pairwise labels, which suggests new paradigms for obtaining labeled data that can potentially be less costly than the existing ones.…”

Section: Label Efficiency Of Modular Deep Learningmentioning

confidence: 99%

Modularizing Deep Learning via Pairwise Learning With Kernels

Duan¹,

Yu²,

Prı́ncipe³

2020

Preprint

View full text Add to dashboard Cite

By redefining the conventional notions of layers, we present an alternative view on finitely wide, fully trainable deep neural networks as stacked linear models in feature spaces, leading to a kernel machine interpretation. Based on this construction, we then propose a provably optimal modular learning framework for classification, avoiding between-module backpropagation. This modular training approach brings new insights into the label requirement of deep learning: It leverages weak pairwise labels when learning the hidden modules. When training the output module, on the other hand, it requires full supervision but achieves high label efficiency, needing as few as 10 randomly selected labeled examples (one from each class) to achieve 94.88% accuracy on CIFAR-10 using a ResNet-18 backbone. Moreover, modular training enables fully modularized deep learning workflows, which then simplify the design and implementation of pipelines and improve the maintainability and reusability of models. To showcase the advantages of such a modularized workflow, we describe a simple yet reliable method for estimating reusability of pre-trained modules as well as task transferability in a transfer learning setting. At practically no computation overhead, it precisely described the task space structure of 15 binary classification tasks from CIFAR-10.

show abstract

“…Despite decades of intense mathematical progress, the rigorous analysis of the generalization of kernel methods remains a very active and challenging area of research. In recent years, many new kernels have been introduced for both regression and classification tasks; notably, a large number of kernels have been discovered in the context of deep learning, in particular through the so-called Scattering Transform [20], and in close connection with deep neural networks [7,15], yielding ever-improving performance for various practical tasks [1,10,16,25]. Currently, theoretical tools to select the relevant kernel for a given task, i.e.…”

Section: Introductionmentioning

confidence: 99%

Kernel Alignment Risk Estimator: Risk Prediction from Training Data

Jacot,

Şimşek,

Spadaro

et al. 2020

Preprint

View full text Add to dashboard Cite

We study the risk (i.e. generalization error) of Kernel Ridge Regression (KRR) for a kernel K with ridge λ > 0 and i.i.d. observations. For this, we introduce two objects: the Signal Capture Threshold (SCT) and the Kernel Alignment Risk Estimator (KARE). The SCT ϑ K,λ is a function of the data distribution: it can be used to identify the components of the data that the KRR predictor captures, and to approximate the (expected) KRR risk. This then leads to a KRR risk approximation by the KARE ρ K,λ , an explicit function of the training data, agnostic of the true data distribution. We phrase the regression problem in a functional setting. The key results then follow from a finite-size analysis of the Stieltjes transform of general Wishart random matrices. Under a natural universality assumption (that the KRR moments depend asymptotically on the first two moments of the observations) we capture the mean and variance of the KRR predictor. We numerically investigate our findings on the Higgs and MNIST datasets for various classical kernels: the KARE gives an excellent approximation of the risk, thus supporting our universality assumption. Using the KARE, one can compare choices of Kernels and hyperparameters directly from the training set. The KARE thus provides a promising data-dependent procedure to select Kernels that generalize well.Preprint. Under review.

show abstract

Neural Kernels Without Tangents

Cited by 8 publications

References 0 publications

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Modularizing Deep Learning via Pairwise Learning With Kernels

Kernel Alignment Risk Estimator: Risk Prediction from Training Data

Contact Info

Product

Resources

About