Vivek Ramanujan scite author profile

Unsupervised image-to-image translation techniques are able to map local texture between two domains, but they are typically unsuccessful when the domains require larger shape change. Inspired by semantic segmentation, we introduce a discriminator with dilated convolutions that is able to use information from across the entire image to train a more context-aware generator. This is coupled with a multi-scale perceptual loss that is better able to represent error in the underlying shape of objects. We demonstrate that this design is more capable of representing shape deformation in a challenging toy dataset, plus in complex mappings with significant dataset variation between humans, dolls, and anime faces, and between cats and dogs.

show abstract

What's Hidden in a Randomly Weighted Neural Network?

Ramanujan¹,

Wortsman²,

Kembhavi³

et al. 2019

Preprint

View full text Add to dashboard Cite

Training a neural network is synonymous with learning the values of the weights. In contrast, we demonstrate that randomly weighted neural networks contain subnetworks which achieve impressive performance without ever training the weight values. Hidden in a randomly weighted Wide ResNet-50 [28] we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 [8] trained on ImageNet [3]. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them. We empirically show that as randomly weighted neural networks with fixed weights grow wider and deeper, an "untrained subnetwork" approaches a network with learned weights in accuracy.

show abstract

Supermasks in Superposition

Wortsman¹,

Ramanujan²,

Liu³

et al. 2020

Preprint

View full text Add to dashboard Cite

We present the Supermasks in Superposition (SupSup) model, capable of sequentially learning thousands of tasks without catastrophic forgetting. Our approach uses a randomly initialized, fixed base network and for each task finds a subnetwork (supermask) that achieves good performance. If task identity is given at test time, the correct subnetwork can be retrieved with minimal memory usage. If not provided, SupSup can infer the task using gradient-based optimization to find a linear superposition of learned supermasks which minimizes the output entropy. In practice we find that a single gradient step is often sufficient to identify the correct mask, even among 2500 tasks. We also showcase two promising extensions. First, SupSup models can be trained entirely without task identity information, as they may detect when they are uncertain about new data and allocate an additional supermask for the new training distribution. Finally the entire, growing set of supermasks can be stored in a constant-sized reservoir by implicitly storing them as attractors in a fixed-sized Hopfield network.

show abstract

Forward Compatible Training for Large-Scale Embedding Retrieval Systems

Ramanujan

Vasu

Farhadi

et al. 2022

View full text Add to dashboard Cite

Improving Shape Deformation in Unsupervised Image-to-Image Translation

Gokaslan¹,

Ramanujan²,

Ritchie³

et al. 2018

Preprint

View full text Add to dashboard Cite

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Merrill¹,

Ramanujan²,

Goldberg³

et al. 2021

View full text Add to dashboard Cite

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude ( 2 norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.

show abstract

Matryoshka Representations for Adaptive Deployment

Kusupati¹,

Bhatt²,

Wallingford³

et al. 2022

Preprint

View full text Add to dashboard Cite

Learned representations are a central component in modern ML systems, serving a multitude of downstream tasks. When training such representations, it is often the case that computational and statistical constraints for each downstream task are unknown. In this context, rigid fixed-capacity representations can be either over or under-accommodating to the task at hand. This leads us to ask: can we design a flexible representation that can adapt to multiple downstream tasks with varying computational resources? Our main contribution is Matryoshka Representation Learning (MRL) which encodes information at different granularities and allows a single embedding to adapt to the computational constraints of downstream tasks. MRL minimally modifies existing representation learning pipelines and imposes no additional cost during inference and deployment. MRL learns coarse-to-fine representations that are at least as accurate and rich as independently trained low-dimensional representations. The flexibility within the learned Matryoshka Representations offer: (a) up to 14× smaller embedding size for ImageNet-1K classification at the same level of accuracy; (b) up to 14× real-world speed-ups for large-scale retrieval on ImageNet-1K and 4K; and (c) up to 2% accuracy improvements for long-tail few-shot classification, all while being as robust as the original representations. Finally, we show that MRL extends seamlessly to web-scale datasets (ImageNet, JFT) across various modalities -vision (ViT, ResNet), vision + language (ALIGN) and language (BERT). MRL code and pretrained models are open-sourced at https://github.com/RAIVNLab/MRL. * Equal contribution -AK led the project with extensive support from GB and AR for experimentation.Preprint. Under review.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Vivek Ramanujan

What’s Hidden in a Randomly Weighted Neural Network?

Improving Shape Deformation in Unsupervised Image-to-Image Translation

What's Hidden in a Randomly Weighted Neural Network?

Supermasks in Superposition

Forward Compatible Training for Large-Scale Embedding Retrieval Systems

Improving Shape Deformation in Unsupervised Image-to-Image Translation

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent

Matryoshka Representations for Adaptive Deployment

Contact Info

Product

Resources

About