RankNAS: Efficient Neural Architecture Search by Pairwise Ranking

Hu, Chi; Wang, Chenglong; Ma, Xiangnan; Xia, Meng; Li, Yinqiao; Xiao, Tong; Zhu, Jun; Li, Changliang

doi:10.18653/v1/2021.emnlp-main.191

Cited by 6 publications

(18 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To automate the network design, we take advantage of neural architecture search (NAS) [11,23]. It has been widely employed in searching standard Transformer architectures in Natural Language Processing [14,16,29,36,38] and Computer Vision [5,6,10,13]. These studies mainly focus on refining search space and/or improving search algorithms.…”

Section: Softmax Attention or Linear Attentionmentioning

confidence: 99%

“…However, these methods suffer from long training and large search costs because all the candidates need to be optimized, evaluated, and ranked. For the purpose of lowering these costs, we utilize RankNAS [16], a new efficient NAS framework for searching the standard Transformer [35]. It can significantly speed up the search procedure through pairwise ranking, search space pruning, and the hardware-aware constraint.…”

Section: Softmax Attention or Linear Attentionmentioning

confidence: 99%

“…Given both the efficient Transformer (i.e., cosFormer [25]) and the search algorithm (i.e., RankNAS [16]), we conduct a preliminary study on pure efficient Transformerbased NAS. Specifically, we replace Softmax with the linear attention introduced in the cosFormer, and then search it with RankNAS.…”

Section: Softmax Attention or Linear Attentionmentioning

confidence: 99%

See 2 more Smart Citations

Neural Architecture Search on Efficient Transformers and Beyond

Liu¹,

Liu²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, numerous efficient Transformers have been proposed to reduce the quadratic computational complexity of standard Transformers caused by the Softmax attention. However, most of them simply swap Softmax with an efficient attention mechanism without considering the customized architectures specially for the efficient attention. In this paper, we argue that the handcrafted vanilla Transformer architectures for Softmax attention may not be suitable for efficient Transformers. To address this issue, we propose a new framework to find optimal architectures for efficient Transformers with the neural architecture search (NAS) technique. The proposed method is validated on popular machine translation and image classification tasks. We observe that the optimal architecture of the efficient Transformer has the reduced computation compared with that of the standard Transformer, but the general accuracy is less comparable. It indicates that the Softmax attention and efficient attention have their own distinctions but neither of them can simultaneously balance the accuracy and efficiency well. This motivates us to mix the two types of attention to reduce the performance imbalance. Besides the search spaces that commonly used in existing NAS Transformer approaches, we propose a new search space that allows the NAS algorithm to automatically search the attention variants along with architectures. Extensive experiments on WMT'14 En-De and CIFAR-10 demonstrate that our searched architecture maintains comparable accuracy to the standard Transformer with notably improved computational efficiency.

show abstract

Section: Softmax Attention or Linear Attentionmentioning

confidence: 99%

Section: Softmax Attention or Linear Attentionmentioning

confidence: 99%

See 1 more Smart Citation

Neural Architecture Search on Efficient Transformers and Beyond

Liu¹,

Liu²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Hence, we accelerate the decoding by reducing the number of decoder layers and removing multi-head mechanism 3 . Inspired by Hu et al (2021), we design the lightweight Transformer student model with one decoder layer. We further remove the multi-head mechanism in the decoder's attention modules.…”

Section: Lightweight Transformer Student Modelsmentioning

confidence: 99%

“…Large and deep Transformer models have dominated machine translation (MT) tasks in recent years (Vaswani et al, 2017;Edunov et al, 2018;Raffel et al, 2020). Despite their high accuracy, these models are inefficient and difficult to deploy (Wang et al, 2020a;Hu et al, 2021;. Many efforts have been made to improve the translation efficiency, including efficient architectures (Li et al, 2021a,b), quantization (Bhandare et al, 2019;, and knowledge distillation Lin et al, 2021a).…”

Section: Introductionmentioning

confidence: 99%

The NiuTrans System for the WMT21 Efficiency Task

Wang¹,

Hu²,

Mu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper describes the NiuTrans system for the WMT21 translation efficiency task 1 . Following last year's work, we explore various techniques to improve efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/postprocessing. Our system can translate 247,000 words per second on an NVIDIA A100, being 3× faster than last year's system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT 2 .

show abstract

Pre-trained Language Models

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

This chapter presents the main architecture types of attention-based language models, which describe the distribution of tokens in texts: Autoencoders similar to BERT receive an input text and produce a contextual embedding for each token. Autoregressive language models similar to GPT receive a subsequence of tokens as input. They produce a contextual embedding for each token and predict the next token. In this way, all tokens of a text can successively be generated. Transformer Encoder-Decoders have the task to translate an input sequence to another sequence, e.g. for language translation. First they generate a contextual embedding for each input token by an autoencoder. Then these embeddings are used as input to an autoregressive language model, which sequentially generates the output sequence tokens. These models are usually pre-trained on a large general training set and often fine-tuned for a specific task. Therefore, they are collectively called Pre-trained Language Models (PLM). When the number of parameters of these models gets large, they often can be instructed by prompts and are called Foundation Models. In further sections we described details on optimization and regularization methods used for training. Finally, we analyze the uncertainty of model predictions and how predictions may be explained.

show abstract

RankNAS: Efficient Neural Architecture Search by Pairwise Ranking

Cited by 6 publications

References 27 publications

Neural Architecture Search on Efficient Transformers and Beyond

Neural Architecture Search on Efficient Transformers and Beyond

The NiuTrans System for the WMT21 Efficiency Task

Pre-trained Language Models

Contact Info

Product

Resources

About