2020
DOI: 10.1007/978-3-030-58529-7_30
|View full text |Cite
|
Sign up to set email alerts
|

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
31
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 58 publications
(33 citation statements)
references
References 26 publications
0
31
0
Order By: Relevance
“…, TSM [3] and TDN [39], with 13% and 15% fewer computation, respectively. For some extremely lightweight methods, like TRN [57] and RubiksNet [63], which can achieve a great reduction of GFLOPs. However, they failed to achieve good performance compared with other state-of-the-art methods, which indicates that it is difficult to make a trade-off between accuracy and computational cost for current video analysis models.…”
Section: B Comparison With State-of-the-artsmentioning
confidence: 99%
“…, TSM [3] and TDN [39], with 13% and 15% fewer computation, respectively. For some extremely lightweight methods, like TRN [57] and RubiksNet [63], which can achieve a great reduction of GFLOPs. However, they failed to achieve good performance compared with other state-of-the-art methods, which indicates that it is difficult to make a trade-off between accuracy and computational cost for current video analysis models.…”
Section: B Comparison With State-of-the-artsmentioning
confidence: 99%
“…The objective of the video encoder is to obtain an embedding vector of size for each video sequence in the batch. We have explored two architectures for this task: RubiksNet [ 31 ] and TimeSformer [ 32 ]. All of them use sequences of length ( in the benchmark and our experiments).…”
Section: Proposed Approachmentioning
confidence: 99%
“…Representative works include C3D [58], I3D [3], ResNet3D [25], X3D [13], etc. Some other works focus on first extracting frame-wise features, and then aggregating temporal information with specialized architectures, such as temporal averaging [63], deploying recurrent networks [10,38,73], and temporal channel shift [12,40,48,56]. Another line of works leverage two-stream architectures to model short-term and long-term temporal relationships respectively [14][15][16]22].…”
Section: Related Workmentioning
confidence: 99%
“…The computational cost of AdaFocusV2+ can be flexibly adjusted without additional training by simply adjusting these thresholds. In our implementation, we solve problem (12) following the method proposed in [28] on training set, which we find performs on par with using cross-validation.…”
Section: Training Techniquesmentioning
confidence: 99%