One-Shot Action Localization by Learning Sequence Matching Network

Yang, Hongtao; He, Xuming; Porikli, Fatih

doi:10.1109/cvpr.2018.00157

Cited by 50 publications

(80 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Video Re-localization [13] aims to find segments in reference videos semantically corresponding to a given query video. A more specialized task, one-shot action localization [52], focuses on the temporal detection of actions in videos giving an example. The STVR task to be solved in this paper is an extension of temporal video re-localization.…”

Section: Related Workmentioning

confidence: 99%

Spatio-Temporal Video Re-Localization by Warp LSTM

Feng

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

The need for efficiently finding the video content a user wants is increasing because of the erupting of usergenerated videos on the Web. Existing keyword-based or content-based video retrieval methods usually determine what occurs in a video but not when and where. In this paper, we make an answer to the question of when and where by formulating a new task, namely spatio-temporal video re-localization. Specifically, given a query video and a reference video, spatio-temporal video re-localization aims to localize tubelets in the reference video such that the tubelets semantically correspond to the query. To accurately localize the desired tubelets in the reference video, we propose a novel warp LSTM network, which propagates the spatiotemporal information for a long period and thereby captures the corresponding long-term dependencies. Another issue for spatio-temporal video re-localization is the lack of properly labeled video datasets. Therefore, we reorganize the videos in the AVA dataset to form a new dataset for spatio-temporal video re-localization research. Extensive experimental results show that the proposed model achieves superior performances over the designed baselines on the spatio-temporal video re-localization task.

show abstract

Section: Related Workmentioning

confidence: 99%

Spatio-Temporal Video Re-Localization by Warp LSTM

Feng

Liu

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…The action localization problem has been well-studied in the computer vision literature [13], [14]. The most closely related work is that of [15], in which they propose a similar few-shot action localization problem and solve it through a meta-learning framework. However, their approach requires a specialized network architecture-a full context embedding network-whereas our approach is fully general, allowing the flexibility of choosing any network architecture.…”

Section: Related Workmentioning

confidence: 99%

One-Shot Learning of Multi-Step Tasks from Observation via Activity Localization in Auxiliary Video

Goo

Niekum

2019

2019 International Conference on Robotics and Automation (ICRA)

View full text Add to dashboard Cite

Due to burdensome data requirements, learning from demonstration often falls short of its promise to allow users to quickly and naturally program robots. Demonstrations are inherently ambiguous and incomplete, making correct generalization to unseen situations difficult without a large number of demonstrations in varying conditions. By contrast, humans are often able to learn complex tasks from a single demonstration (typically observations without action labels) by leveraging context learned over a lifetime. Inspired by this capability, our goal is to enable robots to perform one-shot learning of multi-step tasks from observation by leveraging auxiliary video data as context. Our primary contribution is a novel system that achieves this goal by: (1) using a single user-segmented demonstration to define the primitive actions that comprise a task, (2) localizing additional examples of these actions in unsegmented auxiliary videos via a metalearningbased approach, (3) using these additional examples to learn a reward function for each action, and (4) performing reinforcement learning on top of the inferred reward functions to learn action policies that can be combined to accomplish the task. We empirically demonstrate that a robot can learn multistep tasks more effectively when provided auxiliary video, and that performance greatly improves when localizing individual actions, compared to learning from unsegmented videos.

show abstract

“…As the strength of CNN processing has improved, a recent trend in computer vision is to use a CNN to extract the features of an input image. The extracted features can be used for further location inference [37], [38]. However, a limitation of existing CNN-based visual localization methods is that they do not consider the context of the scene.…”

Section: Related Workmentioning

confidence: 99%

Memory Segment Matching Network Based Image Geo-Localization

et al. 2019

View full text Add to dashboard Cite

Humans and other animals can easily perform self-localization by means of vision. However, that remains a challenging task for computer vision algorithms with traditional image matching methods. In this paper, we propose a memory segment matching network for image geo-localization that is inspired by the discovery of the place cell in the brain by using artificial intelligence. The place cell becomes active when an animal enters a particular location, where the external sensory information in the environment matches features stored in the hippocampus. In order to emulate the operation of the place cell, we employ a convolutional neural network (CNN) and a long-short term memory (LSTM) to extract the visual features of the environment. The extracted features are stored as segmented memory bounded with a location tag. A matching network is utilized to calculate the cross firing probability of the memory segment and the current input visual data. The final prediction of the location is obtained by sending the cross firing probability to an inference engine that uses a hidden Markov model (HMM). According to the simulation results, the localization accuracy reaches up to 95% for the datasets tested, which outperforms the state-of-the-art by 17% in localization detection accuracy. INDEX TERMS Computer vision, image matching, artificial intelligence, memory segment matching network, geo-localization, hidden Markov model (HMM). JIENAN CHEN (S'10-M'14) received the B.S. and Ph.D. degrees in communication systems from the

show abstract

One-Shot Action Localization by Learning Sequence Matching Network

Cited by 50 publications

References 26 publications

Spatio-Temporal Video Re-Localization by Warp LSTM

Spatio-Temporal Video Re-Localization by Warp LSTM

One-Shot Learning of Multi-Step Tasks from Observation via Activity Localization in Auxiliary Video

Memory Segment Matching Network Based Image Geo-Localization

Contact Info

Product

Resources

About