Multi-Moments in Time: Learning and Interpreting Models for Multi-Action Video Understanding

Monfort, Mathew; Pan, Bowen; Ramakrishnan, K. R.; Andonian, Alex; McNamara, Barry; Lascelles, Alex; Fan, Quanfu; Gutfreund, Dan; Feris, Rogério; Oliva, Aude

doi:10.1109/tpami.2021.3126682

Cited by 29 publications

(33 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dataset. We evaluate the proposed DEAR method on three commonly used real-world video action datasets, including UCF-101 [55], HMDB-51 [31], and MiT-v2 [39]. All models are trained on UCF-101 training split.…”

Section: Methodsmentioning

confidence: 99%

“…Evaluation Protocol. To evaluate the classification per- [31] and MiT-v2 [39], respectively. For Open maF1 scores, both the mean and standard deviation of 10 random trials of unknown class selection are reported.…”

Section: Methodsmentioning

confidence: 99%

“…OSAR Methods UCF-101 [55] + HMDB-51 [31] UCF-101 [55] + MiT-v2 [39] Closed Set Accuracy (%) (For reference only) Open maF1 (%) Open Set AUC (%) Open maF1 (%) Open Set AUC (%) I3D [8] OpenMax [5] 67.85 ± 0.12 74. 34…”

Section: Modelsmentioning

confidence: 99%

“…In fact, novel challenges arise in OSAR from the following key aspects. First, the temporal nature of videos may [31] and MiT-v2 [39] are separately used as small-and largescale unknown data for models trained on the closed set UCF-101 [55]. Our DEAR method ( ) significantly outperforms existing approaches on multiple action recognition models.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Evidential Deep Learning for Open Set Action Recognition

Bao

Kong

2021

Preprint

View full text Add to dashboard Cite

In a real-world scenario, human actions are typically out of the distribution from training data, which requires a model to both recognize the known actions and reject the unknown. Different from image data, video actions are more challenging to be recognized in an open-set setting due to the uncertain temporal dynamics and static bias of human actions. In this paper, we propose a Deep Evidential Action Recognition (DEAR) method to recognize actions in an open testing set. Specifically, we formulate the action recognition problem from the evidential deep learning (EDL) perspective and propose a novel model calibration method to regularize the EDL training. Besides, to mitigate the static bias of video representation, we propose a plug-and-play module to debias the learned representation through contrastive learning. Experimental results show that our DEAR method achieves consistent performance gain on multiple mainstream action recognition models and benchmarks. Codes and pre-trained weights will be made available upon paper acceptance.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Evidential Deep Learning for Open Set Action Recognition

Bao

Kong

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…More recent datasets consider videos at an atomic level, with fine-grained temporal annotations from short snippets of longer videos [25,49,84]. In particular, Multi-Moments in Time [50] provides 2M action labels for 1M short clips of 3s, classified into 313 annotated action classes. Something-Something [24] collects more than 100k videos annotated with 147 classes of daily human-object interactions.…”

Section: Related Workmentioning

confidence: 99%

SoccerNet-v2: A Dataset and Benchmarks for Holistic Understanding of Broadcast Soccer Videos

Deliège¹,

Cioppa²,

Giancola³

et al. 2020

Preprint

View full text Add to dashboard Cite

Understanding broadcast videos is a challenging task in computer vision, as it requires generic reasoning capabilities to appreciate the content offered by the video editing. In this work, we propose SoccerNet-v2, a novel large-scale corpus of manual annotations for the SoccerNet [22] video dataset, along with open challenges to encourage more research in soccer understanding and broadcast production. Specifically, we release around 300k annotations within SoccerNet's 500 untrimmed broadcast soccer videos. We extend current tasks in the realm of soccer to include action spotting, camera shot segmentation with boundary detection, and we define a novel replay grounding task. For each task, we provide and discuss benchmark results, reproducible with our open-source adapted implementations of the most relevant works in the field. SoccerNet-v2 is presented to the broader research community to help push computer vision closer to automatic solutions for more general video understanding and production purposes.

show abstract