SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Kong, Zhenglun; Dong, Peiyan; Ma, Xiaolong; Meng, Xin; Niu, Wei; Sun, Mingwei; Ren, Bin; Qin, Minghai; Tang, Hao; Wang, Yanzhi

doi:10.48550/arxiv.2112.13890

Cited by 5 publications

(7 citation statements)

References 53 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[53,15,30,69] propose different heuristics based on the attention weights to halt or aggregate tokens. [25] combines both token selection and aggregation. [79] proposes a slow-fast token update that applies token-wise transformations on the halted tokens and attention-based transformations on those that are not halted.…”

Section: Dynamic Transformermentioning

confidence: 99%

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Ye¹,

Meyer²,

Chai³

et al. 2023

Preprint

View full text Add to dashboard Cite

Balancing efficiency and accuracy is a long-standing problem for deploying deep learning models. The tradeoff is even more important for real-time safety-critical systems like autonomous vehicles. In this paper, we propose an effective approach for accelerating transformer-based 3D object detectors by dynamically halting tokens at different layers depending on their contribution to the detection task. Although halting a token is a non-differentiable operation, our method allows for differentiable end-to-end learning by leveraging an equivalent differentiable forward-pass. Furthermore, our framework allows halted tokens to be reused to inform the model's predictions through a straightforward token recycling mechanism. Our method significantly improves the Pareto frontier of efficiency versus accuracy when compared with the existing approaches. By halting tokens and increasing model capacity, we are able to improve the baseline model's performance without increasing the model's latency on the Waymo Open Dataset.

show abstract

Section: Dynamic Transformermentioning

confidence: 99%

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Ye¹,

Meyer²,

Chai³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Hard pruning methods filter out some unimportant tokens according to a predefined scoring mechanism. DynamicViT [31], SPViT [21], and AdaViT [29] introduce additional prediction networks to score the tokens. Evo-ViT [45], ATS [16], and EViT [24] utilize the values of class tokens to evaluate the importance of tokens.…”

Section: Related Workmentioning

confidence: 99%

Making Vision Transformers Efficient from A Token Sparsification View

Sheng¹,

Wang²,

Ma³

et al. 2023

Preprint

View full text Add to dashboard Cite

The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. Code is available at https: //github.com/changsn/STViT-R.

show abstract

“…ToMe [2] merges similar tokens to reduce the length of the input sequence. The other approaches [10,12] prune tokens into a single token to reduce the length of an input sequence. However, our method processes multiple inputs at the same time, naturally reducing the computational cost.…”

Section: Related Workmentioning

confidence: 99%

“…These improvements, however, came at the cost of rapidly increasing computational burden, with the introduction of Transformer [4,7,20] marking a major milestone in this aspect. With the growing popularity of transformers, methods to reduce their computational costs have become a prominent research topic [1,2,10,12,17,22].…”

mentioning

confidence: 99%

Calculated Nationalism in Contemporary South Korea

Han¹

2023

View full text Add to dashboard Cite

show abstract

SPViT: Enabling Faster Vision Transformers via Soft Token Pruning

Cited by 5 publications

References 53 publications

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Efficient Transformer-based 3D Object Detection with Dynamic Token Halting

Making Vision Transformers Efficient from A Token Sparsification View

Calculated Nationalism in Contemporary South Korea

Contact Info

Product

Resources

About