2019
DOI: 10.48550/arxiv.1906.02762
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Abstract: The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system. In particular, how words in a sentence are abstracted into contexts by passing th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
46
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
2

Relationship

2
7

Authors

Journals

citations
Cited by 60 publications
(46 citation statements)
references
References 40 publications
(77 reference statements)
0
46
0
Order By: Relevance
“…Conformer [16] achieves state-of-theart results on LibriSpeech, outperforming the previous best published Transformer Transducer by 15% relative improvement on the test-other dataset. The core methods they proposed are replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers which is inspired by Macaron-Net [28], and using a convolution module which contains a gating mechanism after multi-headed self-attention.…”
Section: Conformermentioning
confidence: 99%
“…Conformer [16] achieves state-of-theart results on LibriSpeech, outperforming the previous best published Transformer Transducer by 15% relative improvement on the test-other dataset. The core methods they proposed are replacing the original feed-forward layer in the Transformer block into two half-step feed-forward layers which is inspired by Macaron-Net [28], and using a convolution module which contains a gating mechanism after multi-headed self-attention.…”
Section: Conformermentioning
confidence: 99%
“…We were mathematically inspired by Del Moral (2004) who studied self-interacting "Feynman-Kac models" using semigroup techniques (including contractions for nonlinear operators on measures). An interacting particle interpretation of attention is studied in Lu et al (2019) using tools from dynamical systems theory.…”
Section: Related Workmentioning
confidence: 99%
“…Specific to Transformer structures, various modifications have been proposed. For example, human knowledge powered designs include DynamicConv , Macaron Network (Lu et al, 2019), Reformer (Kitaev et al, 2020) and others (Fonollosa et al, 2019;Ahmed et al, 2017;Shaw et al, 2018). As for automatic searching, neural architecture We show two ordered encoders/decoders here.…”
Section: Related Workmentioning
confidence: 99%
“…A Transformer model is stacked by several identical blocks, and each block consists of sequentially ordered layers: the self-attention (SA), encoder-decoder attention (ED) (decoder only) and feed-forward (FF) layer. Recently, various modifications have been proposed, where the focus is on replacing or inserting some components (e.g., attention layer/layer norm/position encoding) in standard Transformer Lu et al, 2019;Shaw et al, 2018;So et al, 2019;Ahmed et al, 2017). Despite these Transformer alternatives have achieved improved performances, one critical element is almost neglected in current models, which is how to arrange the components within a Transformer network, i.e., the layer order also matters.…”
Section: Introductionmentioning
confidence: 99%