We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.
The Discrete Fourier Transform (DFT) is a fundamental component of numerous computational techniques in signal processing and scientific computing. The most popular means of computing the DFT is the Fast Fourier Transform (FFT). However, with the emergence of big data problems, in which the size of the processed data sets can easily exceed terabytes, the "Fast" in Fast Fourier Transform is often no longer fast enough. In addition, in many big data applications it is hard to acquire a sufficient amount of data in order to compute the desired Fourier transform in the first place. The Sparse Fourier Transform (SFT) addresses the big data setting by computing a compressed Fourier transform using only a subset of the input data, in time sub-linear in the data set size. The goal of this article is to survey these recent developments, to explain the basic techniques with examples and applications in big data, to demonstrate trade-offs in empirical performance of the algorithms, and to discuss the connection between the SFT and other techniques for massive data analysis such as streaming algorithms and compressive sensing.
We design a new, fast algorithm for agnostically learning univariate probability distributions whose densities are well-approximated by piecewise polynomial functions. Let f be the density function of an arbitrary univariate distribution, and suppose that f is OPT-close in L 1 -distance to an unknown piecewise polynomial function with t interval pieces and degree d. For any γ > 0, ourruns in time O(n), and with probability at least 9/10 outputs an O γ (t)-piecewise degree-d hypothesis h that is (3 + γ) · OPT + close to f. Our approximation factor almost matches the best known information-theoretic (but computationally inefficient) upper bound of 3. Our general algorithm yields (nearly) sampleoptimal and nearly-linear time estimators for a wide range of structured distribution families over both continuous and discrete domains in a unified way. For most of our applications, these are the first sample-optimal and nearly-linear time estimators in the literature. As a consequence, our work resolves the sample and computational complexities of a broad class of inference tasks via a single "meta-algorithm". Moreover, we demonstrate that our algorithm performs very well in experiments.Our algorithm consists of three levels: (i) At the top level, we employ an iterative greedy algorithm for finding a good partition of the real line into the pieces of a piecewise polynomial. (ii) For each piece, we show that the sub-problem of finding a good polynomial fit on the *
We show through theory and experiment that gradient-based explanations of a model quickly reveal the model itself. Our results speak to a tension between the desire to keep a proprietary model secret and the ability to offer model explanations.On the theoretical side, we give an algorithm that provably learns a two-layer ReLU network in a setting where the algorithm may query the gradient of the model with respect to chosen inputs. The number of queries is independent of the dimension and nearly optimal in its dependence on the model size. Of interest not only from a learning-theoretic perspective, this result highlights the power of gradients rather than labels as a learning primitive.Complementing our theory, we give effective heuristics for reconstructing models from gradient explanations that are orders of magnitude more query-efficient than reconstruction attacks relying on prediction interfaces. arXiv:1807.05185v1 [stat.ML]
Compressive Sensing (CS) states that a sparse signal can be recovered from a small number of linear measurements, and that this recovery can be performed efficiently in polynomial time. The framework of model-based compressive sensing (model-CS) leverages additional structure in the signal and provides new recovery schemes that can reduce the number of measurements even further. This idea has led to measurement-efficient recovery schemes for a variety of signal models. However, for any given model, model-CS requires an algorithm that solves the model-projection problem: given a query signal, report the signal in the model that is closest to the query signal. Often, this optimization problem can be computationally very expensive. Moreover, an approximation algorithm is not sufficient to provably recover the signal. As a result, the model-projection problem poses a fundamental obstacle for extending model-CS to many interesting classes of models.In this paper, we introduce a new framework that we call approximation-tolerant modelbased compressive sensing. This framework includes a range of algorithms for sparse recovery that require only approximate solutions for the model-projection problem. In essence, our work removes the aforementioned obstacle to model-based compressive sensing, thereby extending model-CS to a much wider class of signal models. Interestingly, all our algorithms involve both the minimization and maximization variants of the model-projection problem.We instantiate our new framework for a new signal model that we call the Constrained Earth Mover Distance (CEMD) model. This model is particularly useful for signal ensembles where the positions of the nonzero coefficients do not change significantly as a function of spatial (or temporal) location. We develop novel approximation algorithms for both the maximization and the minimization versions of the model-projection problem via graph optimization techniques. Leveraging these algorithms and our framework results in a nearly sample-optimal sparse recovery scheme for the CEMD model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.