ConvNet Architecture Search for Spatiotemporal Feature Learning

Tran, Du; Ray, Jyotisankar; Shou, Zheng; Chang, Shih‐Fu; Paluri, Manohar

doi:10.48550/arxiv.1708.05038

Cited by 99 publications

(102 citation statements)

References 34 publications

Supporting

Mentioning

100

Contrasting

Order By: Relevance

“…Classical models for video action recognition [10,12,17,19,26,66,80,88,89,91] aim to predict action categories without paying attention to action orders as many as possible due to simple frame feature aggregation such as pooling. Nevertheless, our task intends to verify two videos with large as well as subtle step-level transformations.…”

Section: Methodsmentioning

confidence: 99%

“…Traditional action-related tasks such as action recognition, action detection, and action segmentation have been greatly developed due to the advances in CNNs. i) As a means of general video representation, deeplearning-based action recognition can be generally summarized to stream-based methods [10,12,17,19,26,52,66,80,88,89,91,105] and skeleton-based [21,81,93,99] methods. Both kinds of methods aim to produce a feature representation for each trimmed video, to which a video-level label over predefined action categories is predicted according.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…It is well known that optical flow is computed by using the RGB frames, which is time-consuming and would bring in a bottleneck. The second class is based on a series of 3D convolutional networks, such as, C3D [16], I3D [7], T3D [38], Res3D [39], and so on, which are extended from 2D networks in spatiotemporal dimension. Due to the computation consumption of general 3D convolutional networks, Qiu et al [40] proposed a Pseudo-3D residual network (P3D) that decomposes the convolutions into separate 2D spatial and 1D temporal filters.…”

Section: A Rgb-based Action Recognitionmentioning

confidence: 99%

Expansion-Squeeze-Excitation Fusion Network for Elderly Activity Recognition

Shu¹,

Yang²,

Yan³

et al. 2021

Preprint

View full text Add to dashboard Cite

This work focuses on the task of elderly activity recognition, which is a challenging task due to the existence of individual actions and human-object interactions in elderly activities. Thus, we attempt to effectively aggregate the discriminative information of actions and interactions from both RGB videos and skeleton sequences by attentively fusing multimodal features. Recently, some nonlinear multi-modal fusion approaches are proposed by utilizing nonlinear attention mechanism that is extended from Squeeze-and-Excitation Networks (SENet). Inspired by this, we propose a novel Expansion-Squeeze-Excitation Fusion Network (ESE-FN) to effectively address the problem of elderly activity recognition, which learns modal and channel-wise Expansion-Squeeze-Excitation (ESE) attentions for attentively fusing the multi-modal features in the modal and channel-wise ways. Specifically, ESE-FN firstly implements the modal-wise fusion with the Modal-wise ESE Attention (M-ESEA) to aggregate discriminative information in modal-wise way, and then implements the channel-wise fusion with the Channelwise ESE Attention (C-ESEA) to aggregate the multi-channel discriminative information in channel-wise way (referring to Figure 1). Furthermore, we design a new Multi-modal Loss (ML) to keep the consistency between the single-modal features and the fused multi-modal features by adding the penalty of difference between the minimum prediction losses on single modalities and the prediction loss on the fused modality. Finally, we conduct experiments on a largest-scale elderly activity dataset, i.e., ETRI-Activity3D (including 110,000+ videos, and 50+ categories), to demonstrate that the proposed ESE-FN achieves the best accuracy compared with the state-of-the-art methods. In addition, more extensive experimental results show that the proposed ESE-FN is also comparable to the other methods in terms of normal action recognition task.

show abstract

“…C3D [30] constructs 3D kernels to extract short-term information from the RGB frame input. R(2+1)D [31] applies skip connection to C3D and explore different 3D and 2D convolution combinations. I3D [2] inflates 2D convolutional and pooling kernels of 2D CNN trained on image datasets into 3D to use well-trained 2D CNN parameters.…”

Section: Action Recognitionmentioning

confidence: 99%