Greg Yang scite author profile

Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline tensor program that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized. From our framework follows 1. the convergence of random neural networks to Gaussian processes for architectures such as recurrent neural networks, convolutional neural networks, residual networks, attention, and any combination thereof, with or without batch normalization; 2. conditions under which the gradient independence assumptionthat weights in backpropagation can be assumed to be independent from weights in the forward pass -leads to correct computation of gradient dynamics, and corrections when it does not; 3. the convergence of the Neural Tangent Kernel, a recently proposed kernel used to predict training dynamics of neural networks under gradient descent, at initialization for all architectures in (1) without batch normalization. Mathematically, our framework is general enough to rederive classical random matrix results such as the semicircle and the Marchenko-Pastur laws, as well as recent results in neural network Jacobian singular values. We hope our work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

show abstract

Feature Learning in Infinite-Width Neural Networks

Yang

2020

Preprint

View full text Add to dashboard Cite

As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the Tensor Programs technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases. More generally, we classify a natural space of neural network parametrizations that generalizes standard, NTK, and Mean Field parametrizations. We show 1) any parametrization in this space either admits feature learning or has an infinite-width training dynamics given by kernel gradient descent, but not both; 2) any such infinite-width limit can be computed using the Tensor Programs technique. NTK type state cityWidth 64 Width (Feature Learning)Figure 1: PCA of Word2Vec embeddings of top US cities and states, for NTK, width-64, and width-∞ feature learning networks (Definition 5.1). NTK embeddings are essentially random, while cities and states get naturally separated in embedding space as width increases in the feature learning regime.

show abstract

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Yang¹,

Babuschkin²,

Sidor³

et al. 2022

Preprint

View full text Add to dashboard Cite

Hyperparameter (HP) tuning in deep learning is an expensive process, prohibitively so for neural networks (NNs) with billions of parameters. We show that, in the recently discovered Maximal Update Parametrization (µP), many optimal HPs remain stable even as model size changes. This leads to a new HP tuning paradigm we call µTransfer: parametrize the target model in µP, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all. We verify µTransfer on Transformer and ResNet. For example, 1) by transferring pretraining HPs from a model of 13M parameters, we outperform published numbers of BERT-large (350M parameters), with a total tuning cost equivalent to pretraining BERT-large once; 2) by transferring from 40M parameters, we outperform published numbers of the 6.7B GPT-3 model, with tuning cost only 7% of total pretraining cost. A Pytorch implementation of our technique can be found at github.com/microsoft/mup and installable via pip install mup.Recently, [57] showed that different neural network parametrizations induce different infinitewidth limits and proposed the Maximal Update Parametrization (abbreviated µP) (summarized in Table 3) that enables "maximal" feature learning in the limit. Intuitively, it ensures that each layer is updated on the same order during training regardless of width. 2 In contrast, while the standard parametrization (SP) ensures activations are of unit order at initialization, it actually causes them to blow up in wide models during training [57] essentially due to an imbalance of per-layer † Work done partly during Microsoft AI Residency Program.

show abstract

3DB: A Framework for Debugging Computer Vision Models

Leclerc¹,

Salman²,

Ilyas³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce 3DB: an extendable, uni ed framework for testing and debugging vision models using photorealistic simulation. We demonstrate, through a wide range of use cases, that 3DB allows users to discover vulnerabilities in computer vision systems and gain insights into how models make decisions. 3DB captures and generalizes many robustness analyses from prior work, and enables one to study their interplay. Finally, we nd that the insights generated by the system transfer to the physical world.We are releasing 3DB as a library 1 alongside a set of example analyses 2 , guides 3 , and documentation 4 .

show abstract

NAIL: A General Interactive Fiction Agent

Hausknecht¹,

Loynd²,

Yang³

et al. 2019

Preprint

View full text Add to dashboard Cite

Interactive Fiction (IF) games are complex textual decision making problems. This paper introduces NAIL, an autonomous agent for general parser-based IF games. NAIL won the 2018 Text Adventure AI Competition, where it was evaluated on twenty unseen games. This paper describes the architecture, development, and insights underpinning NAIL's performance. 3 * Equal contribution † Work done while author was at Microsoft Research.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Greg Yang

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

Feature Learning in Infinite-Width Neural Networks

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

3DB: A Framework for Debugging Computer Vision Models

NAIL: A General Interactive Fiction Agent

Contact Info

Product

Resources

About