We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Learning adaptable policies is crucial for robots to operate autonomously in our complex and quickly changing world. In this work, we present a new meta-learning method that allows robots to quickly adapt to changes in dynamics. In contrast to gradient-based meta-learning algorithms that rely on second-order gradient estimation, we introduce a more noise-tolerant Batch Hill-Climbing adaptation operator and combine it with meta-learning based on evolutionary strategies. Our method significantly improves adaptation to changes in dynamics in high noise settings, which are common in robotics applications. We validate our approach on a quadruped robot that learns to walk while subject to changes in dynamics. We observe that our method significantly outperforms prior gradient-based approaches, enabling the robot to adapt its policy to changes based on less than 3 minutes of real data.
Transformer models have achieved state-of-the-art results across a diverse range of domains. However, concern over the cost of training the attention mechanism to learn complex dependencies between distant inputs continues to grow. In response, solutions that exploit the structure and sparsity of the learned attention matrix have blossomed. However, real-world applications that involve long sequences, such as biological sequence analysis, may fall short of meeting these assumptions, precluding exploration of these models. To address this challenge, we present a new Transformer architecture, Performer, based on Fast Attention Via Orthogonal Random features (FAVOR). Our mechanism scales linearly rather than quadratically in the number of tokens in the sequence, is characterized by sub-quadratic space complexity and does not incorporate any sparsity pattern priors. Furthermore, it provides strong theoretical guarantees: unbiased estimation of the attention matrix and uniform convergence. It is also backwards-compatible with pre-trained regular Transformers. We demonstrate its effectiveness on the challenging task of protein sequence modeling and provide detailed theoretical analysis.
The reaction of [Cp′ 2 YMe] 2 (Cp′ ) C 5 H 5 , C 5 H 4 -SiMe 3 ) with B(C 6 F 5 ) 3 affords the complexes Cp′ 2 Y{MeB-(C 6 F 5 ) 3 }. The anion is coordinated in a chelating fashion via one ortho-fluorine atom and agostic interactions to two of the methyl hydrogens; the complexes are highly fluxional in solution. They act as initiators for the carbocationic polymerization of isobutene.
Antimony( 111) and bismuth (111) complexes of sterically demanding arenechalcogenolato ligands, M(EC,H,R',-2.4,6), (E = S or Se; M = Sb or Bi; R' = Me, Pri or But) have been prepared by either protolysis of the amides M [N(SiMe,),], with arenechalcogenols, or from MCI, by halide exchange (M = Bi or Sb). The complexes are monomeric in the solid state and sublime readily. The crystal structure of Sb( SC,H2Prl,-2,4,6), has been determined by X-ray diffraction. The compound possesses a trigonal-pyramidal geometry, with Sb-S distances of 2.41 8(2)-2.438(2) A and S-Sb-S angles of 94.69(7)-98.29(8)". Preliminary X-ray results on Bi(SeC6H2Pra,-2,4,6), showed that the compounds of Sb and Bi are isostructural. Thermolytic decomposition of some of the compounds has been carried out in the solid state. Compounds with R' = Me or Pri undergo reductive elimination to give elemental bismuth or antimony, whereas the bulky selenolates M (SeC6H,But,-2,4,6), afford M,Se,.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.