2022
DOI: 10.48550/arxiv.2210.08164
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Linear Video Transformer with Feature Fixation

Abstract: Vision Transformers have achieved impressive performance in video classification, while suffering from the quadratic complexity caused by the Softmax attention mechanism. Some studies alleviate the computational costs by reducing the number of tokens in attention calculation, but the complexity is still quadratic. Another promising way is to replace Softmax attention with linear attention, which owns linear complexity but presents a clear performance drop. We find that such a drop in linear attention results f… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 65 publications
0
2
0
Order By: Relevance
“…Transformer (Vaswani et al 2017) is advancing steadily in the areas of natural language processing (Qin et al 2023b;Devlin et al 2019;Liu et al 2019;Qin et al 2022b,a;Liu et al 2022;Zhong 2023), computer vision (Dosovitskiy et al 2020;Sun et al 2022b;Lu et al 2022;Hao et al 2024), and audio processing (Gong, Chung, and Glass 2021; Akbari et al 2021;Gulati et al 2020;Sun et al 2022a). Although it outperforms other architectures such as RNNs (Cho et al 2014;Qin, Yang, and Zhong 2023) and CNNs (Kim 2014;Hershey et al 2016;Gehring et al 2017) in many sequence modeling tasks, its lack of length extrapolation capability limits its ability to handle a wide range of sequence lengths, i.e., inference sequences need to be equal to or shorter than training sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer (Vaswani et al 2017) is advancing steadily in the areas of natural language processing (Qin et al 2023b;Devlin et al 2019;Liu et al 2019;Qin et al 2022b,a;Liu et al 2022;Zhong 2023), computer vision (Dosovitskiy et al 2020;Sun et al 2022b;Lu et al 2022;Hao et al 2024), and audio processing (Gong, Chung, and Glass 2021; Akbari et al 2021;Gulati et al 2020;Sun et al 2022a). Although it outperforms other architectures such as RNNs (Cho et al 2014;Qin, Yang, and Zhong 2023) and CNNs (Kim 2014;Hershey et al 2016;Gehring et al 2017) in many sequence modeling tasks, its lack of length extrapolation capability limits its ability to handle a wide range of sequence lengths, i.e., inference sequences need to be equal to or shorter than training sequences.…”
Section: Introductionmentioning
confidence: 99%
“…Transformer [32] is advancing steadily in the areas of natural language processing [4,8,18,27,26,19], computer vision [9,2,31,21], and audio processing [12,1,13,30]. Although it outperforms other architectures such as RNNs [7] and CNNs [16,14,11] in many sequence modeling tasks, its lack of length extrapolation capability limits its ability to handle a wide range of sequence lengths, i.e., inference sequences need to be equal to or shorter than training sequences.…”
Section: Introductionmentioning
confidence: 99%