2021
DOI: 10.48550/arxiv.2102.11972
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Do Transformer Modifications Transfer Across Implementations and Applications?

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
40
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(41 citation statements)
references
References 0 publications
1
40
0
Order By: Relevance
“…There is likely further progress to be made by optimizing the architectures. Within supervised learning, there is ample systematic work on the effects of architectures [49,62], and similar studies could be fruitful in RL. Such large-scale studies on network architecture in RL could be as impactful as theoretical innovations, and we encourage more focus on network architecture.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…There is likely further progress to be made by optimizing the architectures. Within supervised learning, there is ample systematic work on the effects of architectures [49,62], and similar studies could be fruitful in RL. Such large-scale studies on network architecture in RL could be as impactful as theoretical innovations, and we encourage more focus on network architecture.…”
Section: Discussionmentioning
confidence: 99%
“…the output of a layer is x + w 2 Relu w 1 norm(x) . Compared to the original architecture, normalization is applied before the feedforward blocks instead of after, which is now strongly favored in practice [49,68]. We use four transformer blocks for the critic, which is responsible for learning the environment rewards, and two for the actor.…”
Section: Testing Modern Networkmentioning
confidence: 99%
“…In addition, we apply the layer normalization (LN) before the multi-head self-attention (MHA) and the feed-forward blocks (FFN) instead of after [50]. This modification has been unanimously adopted by all current Transformer implementations because it leads to more effective optimization [40]. Especially, for FFN sub-layer, we set the dimensionality of input, output, and the inner-layer to the same dimension with d. We formally characterize the Graphormer layer as below:…”
Section: Implementation Details Of Graphormermentioning
confidence: 99%
“…The power and widespread availability of large pretrained language models such as BERT [17], roBERTa [15], GPT-2 [30], and XLNet [41] has resulted in pretrained language models dominating the field. Although architectural modifications to large pretrained language models have been successful [44], studies have shown that architectural modifications to large language models are [24] brittle and often do not transfer across implementations and applications. In addition, [22] finds that reproducing the results of modifications to BERT architectures is difficult.…”
Section: Target/aspect Sentiment Classification Without Architectural...mentioning
confidence: 99%