RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Fan, Linxi; Buch, Shyamal; Wang, Guanzhi; Cao, Ryan; Zhu, Yuke; Niebles, Juan Carlos; Li, Feifei

doi:10.1007/978-3-030-58529-7_30

Cited by 58 publications

(33 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…, TSM [3] and TDN [39], with 13% and 15% fewer computation, respectively. For some extremely lightweight methods, like TRN [57] and RubiksNet [63], which can achieve a great reduction of GFLOPs. However, they failed to achieve good performance compared with other state-of-the-art methods, which indicates that it is difficult to make a trade-off between accuracy and computational cost for current video analysis models.…”

Section: B Comparison With State-of-the-artsmentioning

confidence: 99%

Action Keypoint Network for Efficient Video Recognition

Chen¹,

Han²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Reducing redundancy is crucial for improving the efficiency of video recognition models. An effective approach is to select informative content from the holistic video, yielding a popular family of dynamic video recognition methods. However, existing dynamic methods focus on either temporal or spatial selection independently while neglecting a reality that the redundancies are usually spatial and temporal, simultaneously. Moreover, their selected content is usually cropped with fixed shapes (e.g., temporally-cropped frames, spatially-cropped patches), while the realistic distribution of informative content can be much more diverse. With these two insights, this paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net). From different frames and positions, AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of "action keypoints" and then transforms the video recognition into point cloud classification. More concretely, AK-Net has two steps, i.e., the keypoint selection and the point cloud classification. First, it inputs the video into a baseline network and outputs a feature map from an intermediate layer. We view each pixel on this feature map as a spatialtemporal point and select some informative keypoints using selfattention. Second, AK-Net devises a ranking criterion to arrange the keypoints into an ordered 1D sequence. Since the video is represented with a 1D sequence after the specified layer, AK-Net transforms the subsequent layers into a point cloud classification sub-net by compacting the original 2D convolutional kernels into 1D kernels. Consequentially, AK-Net brings two-fold benefits for efficiency: The keypoint selection step collects informative content within arbitrary shapes and increases the efficiency for modeling spatial-temporal dependencies, while the point cloud classification step further reduces the computational cost by compacting the convolutional kernels. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.

show abstract

Section: B Comparison With State-of-the-artsmentioning

confidence: 99%

Action Keypoint Network for Efficient Video Recognition

Chen¹,

Han²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The objective of the video encoder is to obtain an embedding vector of size

for each video sequence in the batch. We have explored two architectures for this task: RubiksNet [ 31 ] and TimeSformer [ 32 ]. All of them use sequences of length

(

in the benchmark and our experiments).…”

Section: Proposed Approachmentioning

confidence: 99%

CAPformer: Pedestrian Crossing Action Prediction Using Transformer

Lorenzo

Alonso

Izquierdo

et al. 2021

Sensors

View full text Add to dashboard Cite

Anticipating pedestrian crossing behavior in urban scenarios is a challenging task for autonomous vehicles. Early this year, a benchmark comprising JAAD and PIE datasets have been released. In the benchmark, several state-of-the-art methods have been ranked. However, most of the ranked temporal models rely on recurrent architectures. In our case, we propose, as far as we are concerned, the first self-attention alternative, based on transformer architecture, which has had enormous success in natural language processing (NLP) and recently in computer vision. Our architecture is composed of various branches which fuse video and kinematic data. The video branch is based on two possible architectures: RubiksNet and TimeSformer. The kinematic branch is based on different configurations of transformer encoder. Several experiments have been performed mainly focusing on pre-processing input data, highlighting problems with two kinematic data sources: pose keypoints and ego-vehicle speed. Our proposed model results are comparable to PCPA, the best performing model in the benchmark reaching an F1 Score of nearly 0.78 against 0.77. Furthermore, by using only bounding box coordinates and image data, our model surpasses PCPA by a larger margin (F1=0.75 vs. F1=0.72). Our model has proven to be a valid alternative to recurrent architectures, providing advantages such as parallelization and whole sequence processing, learning relationships between samples not possible with recurrent architectures.

show abstract

“…Representative works include C3D [58], I3D [3], ResNet3D [25], X3D [13], etc. Some other works focus on first extracting frame-wise features, and then aggregating temporal information with specialized architectures, such as temporal averaging [63], deploying recurrent networks [10,38,73], and temporal channel shift [12,40,48,56]. Another line of works leverage two-stream architectures to model short-term and long-term temporal relationships respectively [14][15][16]22].…”

Section: Related Workmentioning

confidence: 99%

“…The computational cost of AdaFocusV2+ can be flexibly adjusted without additional training by simply adjusting these thresholds. In our implementation, we solve problem (12) following the method proposed in [28] on training set, which we find performs on par with using cross-validation.…”

Section: Training Techniquesmentioning

confidence: 99%

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Wang¹,

Yang²,

Lin³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input diversity and training stability. Moreover, a conditional-exit technique is proposed to perform temporal adaptive computation on top of AdaFocus without additional training. Extensive experiments on six benchmark datasets (i.e., ActivityNet, FCVID, Mini-Kinetics, Something-Something V1&V2, and Jester) demonstrate that our model significantly outperforms the original AdaFocus and other competitive baselines, while being considerably more simple and efficient to train. Code is available at https://github.com/ LeapLabTHU/AdaFocusV2.

show abstract

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Cited by 58 publications

References 26 publications

Action Keypoint Network for Efficient Video Recognition

Action Keypoint Network for Efficient Video Recognition

CAPformer: Pedestrian Crossing Action Prediction Using Transformer

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Contact Info

Product

Resources

About