Graph Convolutional Networks for Temporal Action Localization

Zeng, Runhao; Huang, Wenbing; Gan, Chuang; Tan, Mingkui; Rong, Yu; Zhao, Peilin; Huang, Junzhou

doi:10.1109/iccv.2019.00719

Cited by 486 publications

(251 citation statements)

References 35 publications

Supporting

Mentioning

233

Contrasting

Order By: Relevance

“…First, DEG could be easily combined with an algorithm to track an animal's location in an environment 2 , thus allowing the identification of behaviors of interest and where those behaviors occur. Also, while the use of CNNs for classification is standard practice in machine learning, recent works in temporal action detection use widely different sequence modeling approaches and loss functions 29,32,39 . Testing these different approaches in the DEG pipeline could further improve performance.…”

Section: Discussionmentioning

confidence: 99%

“…We modeled our approach after temporal action localization methods used in computer vision aimed to solve related problems [32][33][34][35] . The overall architecture of our solution included: 1. estimating motion (optic flow) from a small snippet of video frames, 2. compressing a snippet of optic flow and individual still images into a lower dimensional set of features, 3. using a sequence of the compressed features to estimate the probability of each behavior at each frame in a video (Fig.…”

Section: Modeling Approachmentioning

confidence: 99%

See 1 more Smart Citation

DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels

Bohnslav

Wimalasena

Clausing

et al. 2020

Preprint

View full text Add to dashboard Cite

Researchers commonly acquire videos of animal behavior and quantify the prevalence of behaviors of interest to study nervous system function, the effects of gene mutations, and the efficacy of pharmacological therapies. This analysis is typically performed manually and is therefore immensely time consuming, often limited to a small number of behaviors, and variable across researchers. Here, we created DeepEthogram: software that takes raw pixel values of videos as input and uses machine learning to output an ethogram, the set of user-defined behaviors of interest present in each frame of a video. We used convolutional neural network models that compute motion in a video, extract features from motion and single frames, and classify these features into behaviors. These models classified behaviors with greater than 90% accuracy on single frames in videos of flies and mice, matching expert-level human performance. The models accurately predicted even extremely rare behaviors, required little training data, and generalized to new videos and subjects. DeepEthogram runs rapidly on common scientific computer hardware and has a graphical user interface that does not require programming by the end-user. We anticipate DeepEthogram will enable the rapid, automated, and reproducible assignment of behavior labels to every frame of a video, thus accelerating all those studies that quantify behaviors of interest.Code is available at: https://github.com/jbohnslav/deepethogram

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Modeling Approachmentioning

confidence: 99%

DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels

Bohnslav

Wimalasena

Clausing

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Temporal action localization aims to detect the temporal boundaries and the categories of action instances in untrimmed videos. The supervised methods [3,27,29,37,44] mainly adopt the two-stage framework, which first produces a series of temporal action proposals, then predicts the action class and regresses their boundaries. Concretely, Shou et al [29] design three segment-based 3D ConvNet to accurately localize action instances and Zhao et al [44] apply a structured temporal pyramid to explore the context structure of actions.…”

Section: Related Work 21 Temporal Action Localizationmentioning

confidence: 99%

“…Concretely, Shou et al [29] design three segment-based 3D ConvNet to accurately localize action instances and Zhao et al [44] apply a structured temporal pyramid to explore the context structure of actions. Recently, Chao et al [3] transfer the classical Faster-RCNN framework [26] for action localization and Zeng et al [37] exploit proposal-proposal relations using graph convolutional networks. Under the weakly-supervised setting only with video-level action labels, Wang et al [32] design the classification and selection module to reason about the temporal duration of action instances.…”

Section: Related Work 21 Temporal Action Localizationmentioning

confidence: 99%

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Zhang

Lin

Zhao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here 1. CCS CONCEPTS • Information systems → Video search; • Computing methodologies → Activity recognition and understanding.

show abstract

“…GNN integrates the advantages of classical graph models and popular neural networks with a strong relation representation and feature learning ability. GNN has been used in many tasks involving relation inference, such as human-object interaction (HOI) [7,29], scene understanding [19,24], human action localization [48] and human gaze communication [22]. GNN was also used to model the different parts of a human or other objects for action recognition [45] and object tracking [8].…”

Section: Graph Neural Networkmentioning

confidence: 99%

Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras

Zhao

Han

Gan

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Compared to a single fixed camera, multiple moving cameras, e.g., those worn by people, can better capture the human interactive and group activities in a scene, by providing multiple, flexible and possibly complementary views of the involved people. In this setting the actual promotion of activity detection is highly dependent on the effective correlation and collaborative analysis of multiple videos taken by different wearable cameras, which is highly challenging given the time-varying view differences across different cameras and mutual occlusion of people in each video. By focusing on two wearable cameras and the interactive activities that involve only two people, in this paper we develop a new approach that can simultaneously: (i) identify the same persons across the two videos, (ii) detect the interactive activities of interest, including their occurrence intervals and involved people, and (iii) recognize the category of each interactive activity. Specifically, we represent each video by a graph, with detected persons as nodes, and propose a unified Graph Neural Network (GNN) based framework to jointly solve the above three problems. A graph matching network is developed for identifying the same persons across the two videos and a graph inference network is then used for detecting the human interactions. We also build a new video dataset, which provides a benchmark for this study, and conduct extensive experiments to validate the effectiveness and superiority of the proposed method.

show abstract

Graph Convolutional Networks for Temporal Action Localization

Cited by 486 publications

References 35 publications

DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels

DeepEthogram: a machine learning pipeline for supervised behavior classification from raw pixels

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras

Contact Info

Product

Resources

About