3D CNNs with Adaptive Temporal Feature Resolutions

Fayyaz, Mohsen; Bahrami, Emad; Diba, Ali; Noroozi, Mehdi; Adeli, Ehsan; Gool, Luc Van; Gall, Jüergen

doi:10.1109/cvpr46437.2021.00470

Cited by 19 publications

(14 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The proposed approach is compared against the top-scoring approaches of the literature on the three employed datasets, specifically, TBN [44], BAT [16], MARS [62], Fast-S3D [38], RMS [64], CGNL [30], ATFR [72], Ada3D [17], TCPNet [45], LgNet [68], ST-VLAD [50], PivotCorrNN [53], LiteEval [57], AdaFrame [54], Listen to Look [56], SCSampler [73], AR-Net [7], SMART [59], ObjectGraphs [5], MARL [55], FrameExit [6] and AdaFocusV2 [19] (note that not all of these works report results for all the datasets mAP(%) AdaFrame [54] 71.5 Listen to Look [56] 72.3 LiteEval [57] 72.7 SCSampler [73] 72.9 AR-Net [7] 73.8 FrameExit [6] 77.3 AdaFocusV2 [19] 79.0 AR-Net (EfficientNet backbone) [7] 79.7 MARL (ResNet backbone on Kinetics) [55] 82.9 FrameExit (X3D-S backbone) [6] 87 used in the present work). The reported results on FCVID, MiniKinetics and ActivityNet are shown in Tables 1, 2 and 3, respectively.…”

Section: Event Recognition Resultsmentioning

confidence: 99%

“…In [38], the above work is further extended adding a feature gating mechanism, which is a simple selfattention operation. In [72], a differentiable similarity guided sampling module is introduced in the architecture of 3D-CNNs that measures the similarity of temporal feature maps and adaptively adjusts the temporal resolution. In [1], an efficient architecture is proposed, consisting of a 2D-CNN and two lightweight 1D-CNN-based branches to capture spatial information, short-and long-term motion dynamics, respectively, and a 3D-CNN feature enhancement module to obtain more fine-grained spatial and temporal cues.…”

Section: ) Top-down Approachesmentioning

confidence: 99%

“…TBN [44] 69.5 BAT [16] 70.6 MARS (3D ResNet backbone) [62] 72.8 Fast-S3D (Inception backbone ) [38] 78.0 ATFR (X3D-S backbone) [72] 78.0 ATFR (R(2+1)D backbone) [72] 78.2 RMS (SlowOnly backbone) [64] 78.6 ATFR (I3D backbone) [72] 78.8 Ada3D (I3D backbone on Kinetics) [17] 79.2 ATFR (3D Resnet backbone) [72] 79.3 CGNL (Modified ResNet backbone) [30] 79.5 TCPNet (ResNet backbone on Kinetics) [45] 80.7 LgNet (R3D Backbone) [68] 80.9 ViGAT (proposed; ResNet backbone) 74.3 ViGAT (proposed; ViT backbone) 82.1…”

Section: Minikinetics 85kmentioning

confidence: 99%

See 2 more Smart Citations

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

View full text Add to dashboard Cite

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet) a .a Source code and trained models will be made available upon acceptance.INDEX TERMS Video event recognition, eXplainable AI (XAI), graph attention network, factorized attention, bottom-up.

show abstract

Section: Event Recognition Resultsmentioning

confidence: 99%

Section: ) Top-down Approachesmentioning

confidence: 99%

Section: Minikinetics 85kmentioning

confidence: 99%

See 1 more Smart Citation

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

2022

View full text Add to dashboard Cite

show abstract

“…Differently, Wang et al [127] adopted an efficient learnable correlation operator to better learn motion information from 3D appearance features. Fayyaz et al [128] addressed the problem of dynamically adapting the temporal feature resolution within the 3D CNNs to reduce their computational cost. A Similarity Guided Sampling (SGS) module was proposed to enable 3D CNNs to dynamically adapt their computational resources by selecting the most informative and distinctive temporal features.…”

Section: D Cnn-based Methodsmentioning

confidence: 99%

Human Action Recognition From Various Data Modalities: A Review

Sun

Rahmani

et al. 2022

IEEE Trans. Pattern Anal. Mach. Intell.

252

135

View full text Add to dashboard Cite

Human Action Recognition (HAR) aims to understand human behavior and assign a label to each action. It has a wide range of applications, and therefore has been attracting increasing attention in the field of computer vision. Human actions can be represented using various data modalities, such as RGB, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, radar, and WiFi signal, which encode different sources of useful yet distinct information and have various advantages depending on the application scenarios. Consequently, lots of existing works have attempted to investigate different types of approaches for HAR using various modalities. In this paper, we present a comprehensive survey of recent progress in deep learning methods for HAR based on the type of input data modality. Specifically, we review the current mainstream deep learning methods for single data modalities and multiple data modalities, including the fusion-based and the co-learning-based frameworks. We also present comparative results on several benchmark datasets for HAR, together with insightful observations and inspiring future research directions.

show abstract

“…Recurrent Neural Networks (RNN) [17,36,72] usually employ 2D CNNs as feature extractors for an LSTM model. 3D CNNbased methods [20,63,64] extend 2D CNNs to 3D structures, to simultaneously model the spatial and temporal context information in videos that is crucial for action recognition.…”

Section: Recognition Of Actions and Body Languagementioning

confidence: 99%

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

Balazia,

Müller,

Tánczos

et al. 2022

Preprint

View full text Add to dashboard Cite

Body language is an eye-catching social signal and its automatic analysis can significantly advance artificial intelligence systems to understand and actively participate in social interactions. While computer vision has made impressive progress in low-level tasks like head and body pose estimation, the detection of more subtle behaviors such as gesturing, grooming, or fumbling is not well explored. In this paper we present BBSI, the first set of annotations of complex Bodily Behaviors embedded in continuous Social Interactions in a group setting. Based on previous work in psychology, we manually annotated 26 hours of spontaneous human behavior in the MPIIGroupInteraction dataset with 15 distinct body language classes. We present comprehensive descriptive statistics on the resulting dataset as well as results of annotation quality evaluations. For automatic detection of these behaviors, we adapt the Pyramid Dilated Attention Network (PDAN), a state-of-the-art approach for * Authors contributed equally.

show abstract

3D CNNs with Adaptive Temporal Feature Resolutions

Cited by 19 publications

References 21 publications

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

ViGAT: Bottom-Up Event Recognition and Explanation in Video Using Factorized Graph Attention Network

Human Action Recognition From Various Data Modalities: A Review

Bodily Behaviors in Social Interaction: Novel Annotations and State-of-the-Art Evaluation

Contact Info

Product

Resources

About