Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Sevilla-Lara, Laura; Zha, Shengxin; Yan, Zhicheng; Goswami, Vedanuj; Feiszli, Matt; Torresani, Lorenzo

doi:10.1109/wacv48630.2021.00058

Cited by 43 publications

(29 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We followed the same protocol: 64/12/24 classes and 13063/2210/4472 videos for meta-training, meta-validation and meta-testing respectively. Whilst Kinetics is one of the most commonly evaluated datasets, visual appearance and background encapsulate most class-related information rather than motion patterns [21]. With less need for temporal modeling and involving coarse-grained action classes, it presents a relatively easy action classification task.…”

Section: Methodsmentioning

confidence: 99%

Few-shot Action Recognition with Prototype-centered Attentive Learning

Zhu¹,

Toisoul²,

Pérez-Rúa³

et al. 2021

Preprint

View full text Add to dashboard Cite

Few-shot action recognition aims to recognize action classes with few training samples. Most existing methods adopt a meta-learning approach with episodic training. In each episode, the few samples in a meta-training task are split into support and query sets. The former is used to build a classifier, which is then evaluated on the latter using a query-centered loss for model updating. There are however two major limitations: lack of data efficiency due to the query-centered only loss design and inability to deal with the support set outlying samples and inter-class distribution overlapping problems. In this paper, we overcome both limitations by proposing a new Prototype-centered Attentive Learning (PAL) model composed of two novel components. First, a prototype-centered contrastive learning loss is introduced to complement the conventional querycentered learning objective, in order to make full use of the limited training samples in each episode. Second, PAL further integrates a hybrid attentive learning mechanism that can minimize the negative impacts of outliers and promote class separation. Extensive experiments on four standard few-shot action benchmarks show that our method clearly outperforms previous state-of-the-art methods, with the improvement particularly significant (> 10%) on the most challenging fine-grained action recognition benchmark.

show abstract

Section: Methodsmentioning

confidence: 99%

Few-shot Action Recognition with Prototype-centered Attentive Learning

Zhu¹,

Toisoul²,

Pérez-Rúa³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…UCF101 [23] and HMDB101 [17]), our proposed TRX primarily focuses on fine-grained actions where temporal information is required. Several works [14,21,15] showcased these traditional datasets to be appearance-based with a single-frame or shuffled frames sufficient to recognise the action. SSv2, in particular, has been shown to require temporal reasoning (e.g.…”

Section: Setupmentioning

confidence: 99%

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Perrett¹,

Masullo²,

Burghardt³

et al. 2021

Preprint

View full text Add to dashboard Cite

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot action recognition works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers achieve state-of-the-art results on both Kinetics and Something-Something V2 (SSv2), outperforming prior work on SSv2 by a wide margin (6.8%) due to the method's ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

show abstract

“…Another work on explainability for video models is by Price et al [38], but only one type of model, and its decisions, is studied (TRN [54]). We are connected to the work of Sevilla-Lara et al [40], who discuss the risk that models with strong image modeling abilities may prioritize those cues over the temporal modeling cues. Similar to the findings of Geirhos et al [16], Sevilla-Lara et al find that inflated convolutions tend to learn classes better where motion is less important, and that generalization can be helped by training on more temporally focused data (in analogy to training on shape-based data in [16]).…”

Section: Related Workmentioning

confidence: 99%

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Broomé¹,

Pokropek²,

Li³

2021

Preprint

View full text Add to dashboard Cite

Most action recognition models today are highly parameterized, and evaluated on datasets with predominantly spatially distinct classes. Previous results for single images have shown that 2D Convolutional Neural Networks (CNNs) tend to be biased toward texture rather than shape for various computer vision tasks [16], reducing generalization. Taken together, this raises suspicion that large video models learn spurious correlations rather than to track relevant shapes over time and infer generalizable semantics from their movement. A natural way to avoid parameter explosion when learning visual patterns over time is to make use of recurrence across the time-axis. In this article, we empirically study the cross-domain robustness for recurrent, attention-based and convolutional video models, respectively, to investigate whether this robustness is influenced by the frame dependency modeling. Our novel Temporal Shape dataset is proposed as a light-weight dataset to assess the ability to generalize across temporal shapes which are not revealed from single frames. We find that when controlling for performance and layer structure, recurrent models show better out-of-domain generalization ability on the Temporal Shape dataset than convolution-and attention-based models. Moreover, our experiments indicate that convolution-and attention-based models exhibit more texture bias on Diving48 than recurrent models.

show abstract

Only Time Can Tell: Discovering Temporal Data for Temporal Modeling

Cited by 43 publications

References 21 publications

Few-shot Action Recognition with Prototype-centered Attentive Learning

Few-shot Action Recognition with Prototype-centered Attentive Learning

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Recur, Attend or Convolve? Frame Dependency Modeling Matters for Cross-Domain Robustness in Action Recognition

Contact Info

Product

Resources

About