Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

Chen, Chun-Fu Richard; Panda, Rameswar; Ramakrishnan, K. R.; Feris, Rogério; Cohn, Jeffrey P.; Oliva, Aude; Fan, Quanfu

doi:10.1109/cvpr46437.2021.00610

Cited by 76 publications

(48 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using the same backbone network (ResNet-18), TCL needs only 15% and 33% labeled data in Jester and Mini-Something-V2 respectively to reach the performance of the fully supervised approach. Likewise, we observe as good as 8.14% and 4.63% absolute improvement in activity recognition performance over the next best approach, FixMatch [46] (NeurIPS'20) using only 5% labeled data in Mini-Something-V2 [9] and Kinetics-400 [31] datasets respectively. We benchmark several baselines by extending state-of-the-art image-domain semi-supervised approaches to videos and will release their codes along with that of TCL on publication.…”

Section: Percentage Of Labeled Datamentioning

confidence: 62%

See 1 more Smart Citation

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Singh¹,

Chakraborty²,

Varshney³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Learning to recognize actions from only a handful of labeled videos is a challenging problem due to the scarcity of tediously collected activity labels. We approach this problem by learning a two-pathway temporal contrastive model using unlabeled videos at two different speeds leveraging the fact that changing video speed does not change an action. Specifically, we propose to maximize the similarity between encoded representations of the same video at two different speeds as well as minimize the similarity between different videos played at different speeds. This way we use the rich supervisory information in terms of 'time' that is present in otherwise unsupervised pool of videos. With this simple yet effective strategy of manipulating video playback rates, we considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods across multiple diverse benchmark datasets and network architectures. Interestingly, our proposed approach benefits from out-of-domain unlabeled videos showing generalization and robustness. We also perform rigorous ablations and analysis to validate our approach.

show abstract

Section: Percentage Of Labeled Datamentioning

confidence: 62%

“…Datasets. We evaluate our approach using four datasets, namely Mini-Something-V2 [9], Jester [36], Kinetics-400 [31] and Charades-Ego [43]. Mini-Something-V2 is a subset of Something-Something V2 dataset [23] containing 81K training videos and 12K testing videos across 87 action classes.…”

Section: Methodsmentioning

confidence: 99%

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Singh¹,

Chakraborty²,

Varshney³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Given the growing demand for eHealth apps 1 , it is surprising that there is not a larger body of work on estimating physical intensity of activities in videos. This might be due to the general focus of video classification research evolv- ing mostly around activity categorization [6,8,13,15,17], while virtually all exercise intensity assessment datasets focus on wearable sensors [2,5,39] delivering, e.g., heart rate or accelerometer signals. To promote the task of visually estimating the hourly amount of kilocalories burned by the human during the current activity, we introduce the novel Vid2Burn dataset, featuring > 9K videos of 72 different activity types with both caloric expenditure annotations on category-and sample-level.…”

Section: Vid2burn: a Benchmark For Estimating Caloric Expenditure In ...mentioning

confidence: 99%

Should I take a walk? Estimating Energy Expenditure from Video Data

Peng¹,

Roitberg²,

Yang³

et al. 2022

Preprint

View full text Add to dashboard Cite

We explore the problem of automatically inferring the amount of kilocalories used by human during physical activity from his/her video observation. To study this underresearched task, we introduce Vid2Burn -an omni-source benchmark for estimating caloric expenditure from video data featuring both, high-and low-intensity activities for which we derive energy expenditure annotations based on models established in medical literature.In practice, a training set would only cover a certain amount of activity types, and it is important to validate, if the model indeed captures the essence of energy expenditure, (e.g., how many and which muscles are involved and how intense they work) instead of memorizing fixed values of specific activity categories seen during training. Ideally, the models should look beyond such category-specific biases and regress the caloric cost in videos depicting activity categories not explicitly present during training. With this property in mind, Vid2Burn is accompanied with a crosscategory benchmark, where the task is to regress caloric expenditure for types of physical activities not present during training. An extensive evaluation of state-of-the-art approaches for video recognition modified for the energy expenditure estimation task demonstrates the difficulty of this problem, especially for new activity types at test-time, marking a new research direction. Dataset and code are available at https://github.com/KPeng9510/ Vid2Burn.

show abstract

“…CNN-based Action Recognition. Action recognition is dominated by CNN-based models recently [17,6,15,16,7,26,47,57,28,20,46]. These models process the video as a cube to extract spatial-temporal features via the proposed temporal modeling methods.…”

Section: Related Workmentioning

confidence: 99%

“…Top-1 Top-5 TRN-Incpetion [57] 28.3 53.9 TAM-R50 [15] 30.8 58.2 I3D-R50 [7] 31.2 58.9 SlowFast-R50-8×8 [17] 31.2 58.7 CoST-R101 [24] 32.4 60.0 SRTG-R3D-101 [41] 33.6 58.5 AssembleNet [37] 33.9 60.9 ViViT-L [1] 38.0 64.9 SIFAR-15 ‡ 38.5 67.4 SIFAR-12 ‡ 39.9 69.2…”

Section: Modelmentioning

confidence: 99%

Can An Image Classifier Suffice For Action Recognition?

Fan¹,

Chun-Fu²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Ki-netics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.

show abstract

Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition

Cited by 76 publications

References 54 publications

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Semi-Supervised Action Recognition with Temporal Contrastive Learning

Should I take a walk? Estimating Energy Expenditure from Video Data

Can An Image Classifier Suffice For Action Recognition?

Contact Info

Product

Resources

About