2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00470
|View full text |Cite
|
Sign up to set email alerts
|

3D CNNs with Adaptive Temporal Feature Resolutions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 19 publications
(14 citation statements)
references
References 21 publications
0
12
0
Order By: Relevance
“…The proposed approach is compared against the top-scoring approaches of the literature on the three employed datasets, specifically, TBN [44], BAT [16], MARS [62], Fast-S3D [38], RMS [64], CGNL [30], ATFR [72], Ada3D [17], TCPNet [45], LgNet [68], ST-VLAD [50], PivotCorrNN [53], LiteEval [57], AdaFrame [54], Listen to Look [56], SCSampler [73], AR-Net [7], SMART [59], ObjectGraphs [5], MARL [55], FrameExit [6] and AdaFocusV2 [19] (note that not all of these works report results for all the datasets mAP(%) AdaFrame [54] 71.5 Listen to Look [56] 72.3 LiteEval [57] 72.7 SCSampler [73] 72.9 AR-Net [7] 73.8 FrameExit [6] 77.3 AdaFocusV2 [19] 79.0 AR-Net (EfficientNet backbone) [7] 79.7 MARL (ResNet backbone on Kinetics) [55] 82.9 FrameExit (X3D-S backbone) [6] 87 used in the present work). The reported results on FCVID, MiniKinetics and ActivityNet are shown in Tables 1, 2 and 3, respectively.…”
Section: Event Recognition Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…The proposed approach is compared against the top-scoring approaches of the literature on the three employed datasets, specifically, TBN [44], BAT [16], MARS [62], Fast-S3D [38], RMS [64], CGNL [30], ATFR [72], Ada3D [17], TCPNet [45], LgNet [68], ST-VLAD [50], PivotCorrNN [53], LiteEval [57], AdaFrame [54], Listen to Look [56], SCSampler [73], AR-Net [7], SMART [59], ObjectGraphs [5], MARL [55], FrameExit [6] and AdaFocusV2 [19] (note that not all of these works report results for all the datasets mAP(%) AdaFrame [54] 71.5 Listen to Look [56] 72.3 LiteEval [57] 72.7 SCSampler [73] 72.9 AR-Net [7] 73.8 FrameExit [6] 77.3 AdaFocusV2 [19] 79.0 AR-Net (EfficientNet backbone) [7] 79.7 MARL (ResNet backbone on Kinetics) [55] 82.9 FrameExit (X3D-S backbone) [6] 87 used in the present work). The reported results on FCVID, MiniKinetics and ActivityNet are shown in Tables 1, 2 and 3, respectively.…”
Section: Event Recognition Resultsmentioning
confidence: 99%
“…In [38], the above work is further extended adding a feature gating mechanism, which is a simple selfattention operation. In [72], a differentiable similarity guided sampling module is introduced in the architecture of 3D-CNNs that measures the similarity of temporal feature maps and adaptively adjusts the temporal resolution. In [1], an efficient architecture is proposed, consisting of a 2D-CNN and two lightweight 1D-CNN-based branches to capture spatial information, short-and long-term motion dynamics, respectively, and a 3D-CNN feature enhancement module to obtain more fine-grained spatial and temporal cues.…”
Section: ) Top-down Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…Differently, Wang et al [127] adopted an efficient learnable correlation operator to better learn motion information from 3D appearance features. Fayyaz et al [128] addressed the problem of dynamically adapting the temporal feature resolution within the 3D CNNs to reduce their computational cost. A Similarity Guided Sampling (SGS) module was proposed to enable 3D CNNs to dynamically adapt their computational resources by selecting the most informative and distinctive temporal features.…”
Section: D Cnn-based Methodsmentioning
confidence: 99%
“…Recurrent Neural Networks (RNN) [17,36,72] usually employ 2D CNNs as feature extractors for an LSTM model. 3D CNNbased methods [20,63,64] extend 2D CNNs to 3D structures, to simultaneously model the spatial and temporal context information in videos that is crucial for action recognition.…”
Section: Recognition Of Actions and Body Languagementioning
confidence: 99%