Do Transformer Modifications Transfer Across Implementations and Applications?

Narang, Sharan; Chung, Hyung Won; Tay, Yi; Fedus, William; Févry, Thibault; Matena, Michael; Malkan, Karishma; Fiedel, Noah; Shazeer, Noam; Lan, Zhenzhong; Zhou, Yanqi; Li, Wei; Ding, Nan; Marcus, Jake; Roberts, Adam; Raffel, Colin

doi:10.48550/arxiv.2102.11972

Cited by 25 publications

(41 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is likely further progress to be made by optimizing the architectures. Within supervised learning, there is ample systematic work on the effects of architectures [49,62], and similar studies could be fruitful in RL. Such large-scale studies on network architecture in RL could be as impactful as theoretical innovations, and we encourage more focus on network architecture.…”

Section: Discussionmentioning

confidence: 99%

“…the output of a layer is x + w 2 Relu w 1 norm(x) . Compared to the original architecture, normalization is applied before the feedforward blocks instead of after, which is now strongly favored in practice [49,68]. We use four transformer blocks for the critic, which is responsible for learning the environment rewards, and two for the actor.…”

Section: Testing Modern Networkmentioning

confidence: 99%

See 1 more Smart Citation

Towards Deeper Deep Reinforcement Learning with Spectral Normalization

Björck¹,

Gomes²,

Weinberger³

2021

Preprint

View full text Add to dashboard Cite

In computer vision and natural language processing, innovations in model architecture that lead to increases in model capacity have reliably translated into gains in performance. In stark contrast with this trend, state-of-the-art reinforcement learning (RL) algorithms often use only small MLPs, and gains in performance typically originate from algorithmic innovations. It is natural to hypothesize that small datasets in RL necessitate simple models to avoid overfitting; however, this hypothesis is untested. In this paper we investigate how RL agents are affected by exchanging the small MLPs with larger modern networks with skip connections and normalization, focusing specifically on soft actor-critic (SAC) algorithms. We verify, empirically, that naïvely adopting such architectures leads to instabilities and poor performance, likely contributing to the popularity of simple models in practice. However, we show that dataset size is not the limiting factor, and instead argue that intrinsic instability from the actor in SAC taking gradients through the critic is the culprit. We demonstrate that a simple smoothing method can mitigate this issue, which enables stable training with large modern architectures. After smoothing, larger models yield dramatic performance improvements for state-ofthe-art agents -suggesting that more "easy" gains may be had by focusing on model architectures in addition to algorithmic innovations.Preprint. Under review.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Testing Modern Networkmentioning

confidence: 99%

Towards Deeper Deep Reinforcement Learning with Spectral Normalization

Björck¹,

Gomes²,

Weinberger³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In addition, we apply the layer normalization (LN) before the multi-head self-attention (MHA) and the feed-forward blocks (FFN) instead of after [50]. This modification has been unanimously adopted by all current Transformer implementations because it leads to more effective optimization [40]. Especially, for FFN sub-layer, we set the dimensionality of input, output, and the inner-layer to the same dimension with d. We formally characterize the Graphormer layer as below:…”

Section: Implementation Details Of Graphormermentioning

confidence: 99%

Do Transformers Really Perform Bad for Graph Representation?

Ying¹,

Cai²,

Luo³

et al. 2021

Preprint

View full text Add to dashboard Cite

The Transformer architecture has become a dominant choice in many domains, such as natural language processing and computer vision. Yet, it has not achieved competitive performance on popular leaderboards of graph-level prediction compared to mainstream GNN variants. Therefore, it remains a mystery how Transformers could perform well for graph representation learning. In this paper, we solve this mystery by presenting Graphormer, which is built upon the standard Transformer architecture, and could attain excellent results on a broad range of graph representation learning tasks, especially on the recent OGB Large-Scale Challenge. Our key insight to utilizing Transformer in the graph is the necessity of effectively encoding the structural information of a graph into the model. To this end, we propose several simple yet effective structural encoding methods to help Graphormer better model graph-structured data. Besides, we mathematically characterize the expressive power of Graphormer and exhibit that with our ways of encoding the structural information of graphs, many popular GNN variants could be covered as the special cases of Graphormer. The code and models of Graphormer will be made publicly available at https://github.com/Microsoft/Graphormer.

show abstract

“…The power and widespread availability of large pretrained language models such as BERT [17], roBERTa [15], GPT-2 [30], and XLNet [41] has resulted in pretrained language models dominating the field. Although architectural modifications to large pretrained language models have been successful [44], studies have shown that architectural modifications to large language models are [24] brittle and often do not transfer across implementations and applications. In addition, [22] finds that reproducing the results of modifications to BERT architectures is difficult.…”

Section: Target/aspect Sentiment Classification Without Architectural...mentioning

confidence: 99%

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

Khan

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Do Transformer Modifications Transfer Across Implementations and Applications?

Cited by 25 publications

References 0 publications

Towards Deeper Deep Reinforcement Learning with Spectral Normalization

Towards Deeper Deep Reinforcement Learning with Spectral Normalization

Do Transformers Really Perform Bad for Graph Representation?

Exploiting BERT for Multimodal Target Sentiment Classification through Input Space Translation

Contact Info

Product

Resources

About