No abstract
Recent advances in deep learning have relied heavily on the use of large Transformers due to their ability to learn at scale. However, the core building block of Transformers, the attention operator, exhibits quadratic cost in sequence length, limiting the amount of context accessible. Existing subquadratic methods based on low-rank and sparse approximations need to be combined with dense attention layers to match Transformers, indicating a gap in capability. In this work, we propose Hyena, a subquadratic drop-in replacement for attention constructed by interleaving implicitly parametrized long convolutions and data-controlled gating. In recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens, Hyena improves accuracy by more than 50 points over operators relying on statespaces and other implicit and explicit methods, matching attention-based models. We set a new state-ofthe-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile), reaching Transformer quality with a 20% reduction in training compute required at sequence length 2K. Hyena operators are twice as fast as highly optimized attention at sequence length 8K, and 100× faster at sequence length 64K.
Continuous-depth learning has recently emerged as a novel perspective on deep learning, improving performance in tasks related to dynamical systems and density estimation. Core to these approaches is the neural differential equation, whose forward passes are the solutions of an initial value problem parametrized by a neural network. Unlocking the full potential of continuous-depth models requires a different set of software tools, due to peculiar differences compared to standard discrete neural networks, e.g inference must be carried out via numerical solvers. We introduce TorchDyn, a PyTorch library dedicated to continuous-depth learning, designed to elevate neural differential equations to be as accessible as regular plug-and-play deep learning primitives. This objective is achieved by identifying and subdividing different variants into common essential components, which can be combined and freely repurposed to obtain complex compositional architectures. TorchDyn further offers step-by-step tutorials and benchmarks designed to guide researchers and contributors.
Neural networks are discrete entities: subdivided into discrete layers and parametrized by weights which are iteratively optimized via difference equations. Recent work proposes networks with layer outputs which are no longer quantized but are solutions of an ordinary differential equation (ODE); however, these networks are still optimized via discrete methods (e.g. gradient descent). In this paper, we explore a different direction: namely, we propose a novel framework for learning in which the parameters themselves are solutions of ODEs. By viewing the optimization process as the evolution of a port-Hamiltonian system, we can ensure convergence to a minimum of the objective function. Numerical experiments have been performed to show the validity and effectiveness of the proposed methods. These authors contributed equally to the work.© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. arXiv:1909.02702v1 [cs.NE] 6 Sep 2019 PH system beHence, the loss function might be redefined adding a term J kin (ϑ, ω) equivalent to a pseudo kinetic energy:.Note that J (û,ŷ, ϑ) represents the potential energy of the fictitious mechanical system.As for general p degrees-of-freedom mechanical system in PH form, the choice of J and R is:
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.