ActionFlowNet: Learning Motion Representation for Action Recognition

Ng, Joe Yue-Hei; Choi, Jonghyun; Neumann, Jan; Davis, Larry S.

doi:10.1109/wacv.2018.00179

Cited by 96 publications

(88 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 8 shows the results. MFNet [15] captures motion by spatially shifting CNN feature maps, then summing the results, TVNet [5] applies a convolutional optical flow method to RGB inputs, and ActionFlowNet [15] 52.5 56.8 TVNet [5] 39.4 57.5 RGB-OFF [21] 55.6 56.9 Ours 61.1 65.4 [16] trains a CNN to jointly predict optical flow and activity classes. We also compare to OFF [21] using only RGB inputs.…”

Section: Flow-of-flowmentioning

confidence: 99%

Representation Flow for Action Recognition

Piergiovanni

Ryoo

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

156

View full text Add to dashboard Cite

In this paper, we propose a convolutional layer inspired by optical flow algorithms to learn motion representations. Our representation flow layer is a fully-differentiable layer designed to capture the 'flow' of any representation channel within a convolutional neural network for action recognition. Its parameters for iterative flow optimization are learned in an end-to-end fashion together with the other CNN model parameters, maximizing the action recognition performance. Furthermore, we newly introduce the concept of learning 'flow of flow' representations by stacking multiple representation flow layers. We conducted extensive experimental evaluations, confirming its advantages over previous recognition models using traditional optical flows in both computational speed and performance. The code is publicly available. 1

show abstract

Section: Flow-of-flowmentioning

confidence: 99%

Representation Flow for Action Recognition

Piergiovanni

Ryoo

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

156

View full text Add to dashboard Cite

show abstract

“…Finally, it is worth noting the self-supervised learning works on "harvesting" training data from unlabeled sources for action recognition. Fernando et al [12] and Mishra et al [28] shuffle the video frames and treat them as positive/negative training data; Sharma et al [34] mines labels using a distance matrix based on similarity although for video face clustering; Wei et al [51] divides a single clip into non-overlapping 10-frame chunks, and then predict the ordering task; Ng et al [29] estimates optical flow while recognizing actions. We compare all these methods against our unsupervised future frame prediction based ConvNet training in the experimental section.…”

Section: Background and Related Workmentioning

confidence: 99%

DynamoNet: Dynamic Action and Motion Network

Diba

Sharma²,

Gool

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

108

View full text Add to dashboard Cite

In this paper, we are interested in self-supervised learning the motion cues in videos using dynamic motion filters for a better motion representation to finally boost human action recognition in particular. Thus far, the vision community has focused on spatio-temporal approaches using standard filters, rather we here propose dynamic filters that adaptively learn the video-specific internal motion representation by predicting the short-term future frames. We name this new motion representation, as dynamic motion representation (DMR) and is embedded inside of 3D convolutional network as a new layer, which captures the visual appearance and motion dynamics throughout entire video clip via end-to-end network learning. Simultaneously, we utilize these motion representation to enrich video classification. We have designed the frame prediction task as an auxiliary task to empower the classification problem.With these overall objectives, to this end, we introduce a novel unified spatio-temporal 3D-CNN architecture (Dy-namoNet) that jointly optimizes the video classification and learning motion representation by predicting future frames as a multi-task learning problem. We conduct experiments on challenging human action datasets: Kinetics 400, UCF101, HMDB51. The experiments using the proposed DynamoNet show promising results on all the datasets.

show abstract

“…However, the major obstacle stems from the lack of high-quality training data. To mitigate the data scarcity, some train an optical flow model from synthesized datasets [18], or predict the label of videos in an end-to-end way for improving the accuracy [26,19]. In addition, the optimization ideas in the traditional methods are integrated into the design of the neural networks.…”

Section: Related Workmentioning

confidence: 99%

Two-Stream Video Classification with Cross-Modality Attention

Tian

2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Fusing multi-modality information is known to be able to effectively bring significant improvement in video classification. However, the most popular method up to now is still simply fusing each stream's prediction scores at the last stage. A valid question is whether there exists a more effective method to fuse information cross modality. With the development of attention mechanism in natural language processing, there emerge many successful applications of attention in the field of computer vision. In this paper, we propose a cross-modality attention operation, which can obtain information from other modality in a more effective way than two-stream. Correspondingly we implement a compatible block named CMA block, which is a wrapper of our proposed attention operation. CMA can be plugged into many existing architectures. In the experiments, we comprehensively compare our method with two-stream and non-local models widely used in video classification. All experiments clearly demonstrate strong performance superiority by our proposed method. We also analyze the advantages of the CMA block by visualizing the attention map, which intuitively shows how the block helps the final prediction.

show abstract

ActionFlowNet: Learning Motion Representation for Action Recognition

Cited by 96 publications

References 17 publications

Representation Flow for Action Recognition

Representation Flow for Action Recognition

DynamoNet: Dynamic Action and Motion Network

Two-Stream Video Classification with Cross-Modality Attention

Contact Info

Product

Resources

About