Learning Spatio-Temporal Representation With Local and Global Diffusion

Qiu, Zhaofan; Yao, Ting; Ngo, Chong‐Wah; Tian, Xinmei; Mei, Tao

doi:10.1109/cvpr.2019.01233

Cited by 175 publications

(98 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 6 shows comparison results with conventional methods on the UCF-101 dataset. LGD-3D Two-stream and PoTion + I3D showed similar accuracies to that of the proposed method, but the accuracy of the proposed method was higher on other datasets [ 6 , 36 ].…”

Section: Resultsmentioning

confidence: 86%

Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum

Kim

Park

et al. 2020

Sensors

View full text Add to dashboard Cite

Various action recognition approaches have recently been proposed with the aid of three-dimensional (3D) convolution and a multiple stream structure. However, existing methods are sensitive to background and optical flow noise, which prevents from learning the main object in a video frame. Furthermore, they cannot reflect the accuracy of each stream in the process of combining multiple streams. In this paper, we present a novel action recognition method that improves the existing method using optical flow and a multi-stream structure. The proposed method consists of two parts: (i) optical flow enhancement process using image segmentation and (ii) score fusion process by applying weighted sum of the accuracy. The enhancement process can help the network to efficiently analyze the flow information of the main object in the optical flow frame, thereby improving accuracy. A different accuracy of each stream can be reflected to the fused score while using the proposed score fusion method. We achieved an accuracy of 98.2% on UCF-101 and 82.4% on HMDB-51. The proposed method outperformed many state-of-the-art methods without changing the network structure and it is expected to be easily applied to other networks.

show abstract

Section: Resultsmentioning

confidence: 86%

Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum

Kim

Park

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…uses a shared network of 2D CNNs over three orthogonal views of video to obtain spatial and temporal signals for action recognition. (Qiu et al, 2019) adopts a twopath network architecture that integrates global and local information of both temporal and spatial dimensions for video classification. Other research areas that investigate spatio-temporal learning include video captioning (Aafaq et al, 2019), video super-resolution (Li et al, 2019b), and video object segmentation (Xu et al, 2019).…”

Section: Related Workmentioning

confidence: 99%

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Lê

Chen

Hoi

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos based on textual cues. Specifically, our approach not only exploits both spatial and temporal-level information, but also learns dynamic information diffusion between the two feature spaces through spatial-to-temporal and temporal-tospatial reasoning. The bidirectional strategy aims to tackle the evolving semantics of user queries in the dialogue setting. The retrieved visual cues are used as contextual information to construct relevant responses to the users. Our empirical results and comprehensive qualitative analysis show that BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA setting, and substantially outperform prior approaches on the TGIF-QA benchmark.

show abstract

“…In=c log(pn), (6) where I n=c is an indicator function which equals to 1 if n is the ground truth class label c, otherwise 0. For location regression, we employ the Smooth L1 loss (S L1 ) to force the proposal (ϕ c , ϕ w ) to move towards its closest ground truth proposal (g c , g w ).…”

Section: Training and Inferencementioning

confidence: 99%

“…With the tremendous increase in online and personal media archives, people are generating, storing, and consuming a large collection of videos. This trend encourages the development of effective and efficient algorithms to intelligently parse video data [1,2,3,4,5,6] and discover semantic information [7,8]. One fundamental challenge underlying the success of these advances is action detection from videos in both temporal [9,10] and spatio-temporal aspects [11].…”

Section: Introductionmentioning

confidence: 99%

Decoupling Localization and Classification in Single Shot Temporal Action Detection

Huang

Dai

2019

2019 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Video temporal action detection aims to temporally localize and recognize the action in untrimmed videos. Existing onestage approaches mostly focus on unifying two subtasks, i.e., localization of action proposals and classification of each proposal through a fully shared backbone. However, such design of encapsulating all components of two subtasks in one single network might restrict the training by ignoring the specialized characteristic of each subtask. In this paper, we propose a novel Decoupled Single Shot temporal Action Detection (Decouple-SSAD) method to mitigate such problem by decoupling the localization and classification in a one-stage scheme. Particularly, two separate branches are designed in parallel to enable each component to own representations privately for accurate localization or classification. Each branch produces a set of action anchor layers by applying deconvolution to the feature maps of the main stream. High-level semantic information from deeper layers is thus incorporated to enhance the feature representations. We conduct extensive experiments on THUMOS14 dataset and demonstrate superior performance over state-of-the-art methods. Our code is available online 1 .

show abstract

Learning Spatio-Temporal Representation With Local and Global Diffusion

Cited by 175 publications

References 46 publications

Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum

Enhanced Action Recognition Using Multiple Stream Deep Learning with Optical Flow and Weighted Sum

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Decoupling Localization and Classification in Single Shot Temporal Action Detection

Contact Info

Product

Resources

About