Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

Yuan, Yuan; Liang, Xiaodan; Wang, Xiaolong; Yeung, Dit–Yan; Gupta, Abhinav

doi:10.1109/iccv.2017.200

Cited by 82 publications

(54 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are on average 6.8 action labels for a video. The official Charades dataset doesn't provide object bounding box annotations and we use the annotations released by [50]. In the released annotations, 1,812 test videos are down-sampled to 1 frame per second (fps) and 17 object classes are labeled with bounding boxes on these frames.…”

Section: Methodsmentioning

confidence: 99%

“…We report per-class average precision (AP) at intersection-over-union (IoU) of 0.5 between detection and ground truth boxes, and also mean AP (mAP) as a combined metric, following the tradition of [50]. We also report CorLoc [9], a commonly-used weakly supervised detection metric.…”

Section: Methodsmentioning

confidence: 99%

“…Yuan et al [50] proposed a much more efficient actiondriven weakly supervised object detection setting which aims to learn the object appearance representation given only videos with clip-level action class labels. They proposed to first extract spatial features from object proposals.…”

Section: Related Workmentioning

confidence: 99%

“…"cup" in the action "drink from cup"). Yuan et al [50] leveraged this prop-Spatial correlation between subject and object Object appearance consistency hold vacuum fix vacuum…”

Section: Introductionmentioning

confidence: 99%

“…We conducted comprehensive experiments over two video datasets: Charades [37], EPIC KITCHENS [7] and an image dataset: HICO-DET [5]. Our method outperforms the previous methods [3,50,41] by a large margin on all datasets. Specifically, we have achieved a 6% mAP boost on Charades compared to current state-of-the-art weakly supervised models for videos.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Activity Driven Weakly Supervised Object Detection

Yang

Mahajan

Ghadiyaram

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Weakly supervised object detection aims at reducing the amount of supervision required to train detection models. Such models are traditionally learned from images/videos labelled only with the object class and not the object bounding box. In our work, we try to leverage not only the object class labels but also the action labels associated with the data. We show that the action depicted in the image/video can provide strong cues about the location of the associated object. We learn a spatial prior for the object dependent on the action (e.g. "ball" is closer to "leg of the person" in "kicking ball"), and incorporate this prior to simultaneously train a joint object detection and action classification model. We conducted experiments on both video datasets and image datasets to evaluate the performance of our weakly supervised object detection model. Our approach outperformed the current state-of-the-art (SOTA) method by more than 6% in mAP on the Charades video dataset.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

“…"cup" in the action "drink from cup"). Yuan et al [50] leveraged this prop-Spatial correlation between subject and object Object appearance consistency hold vacuum fix vacuum…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Activity Driven Weakly Supervised Object Detection

Yang

Mahajan

Ghadiyaram

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

Videos as Space-Time Region Graphs

Wang

Gupta

2018

Lecture Notes in Computer Science

663

532

View full text Add to dashboard Cite

Learning Human-Object Interactions by Graph Parsing Neural Networks

Yang

Jia

et al. 2018

Lecture Notes in Computer Science

445

360

View full text Add to dashboard Cite

This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings. The code is available at https://github.com/SiyuanQi/gpnn.

show abstract

Temporal Dynamic Graph LSTM for Action-Driven Video Object Detection

Cited by 82 publications

References 45 publications

Activity Driven Weakly Supervised Object Detection

Activity Driven Weakly Supervised Object Detection

Videos as Space-Time Region Graphs

Learning Human-Object Interactions by Graph Parsing Neural Networks

Contact Info

Product

Resources

About