Videos as Space-Time Region Graphs

Wang, Xiaolong; Gupta, Abhinav

doi:10.1007/978-3-030-01228-1_25

Cited by 631 publications

(513 citation statements)

References 64 publications

Supporting

Mentioning

512

Contrasting

Order By: Relevance

“…3D-Conv based Methods. 3D-Conv based methods include I3D [2], Nonlocal I3D [34] and the previous state-of-the-art Non-local I3D + GCN [35]. Although Non-local I3D + GCN leverages multiple techniques including extra data (MSCOCO), extra spatio-temporal features (I3D), its performance is still inferior to ours.…”

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 99%

A Spatio-temporal Hybrid Network for Action Recognition

Zhao

2019

2019 IEEE Visual Communications and Image Processing (VCIP)

View full text Add to dashboard Cite

Effective and Efficient spatio-temporal modeling is essential for action recognition. Existing methods suffer from the trade-off between model performance and model complexity. In this paper, we present a novel Spatio-Temporal Hybrid Convolution Network (denoted as "STH") which simultaneously encodes spatial and temporal video information with a small parameter cost. Different from existing works that sequentially or parallelly extract spatial and temporal information with different convolutional layers, we divide the input channels into multiple groups and interleave the spatial and temporal operations in one convolutional layer, which deeply incorporates spatial and temporal clues. Such a design enables efficient spatio-temporal modeling and maintains a small model scale. STH-Conv is a general building block, which can be plugged into existing 2D CNN architectures such as ResNet and Mo-bileNet by replacing the conventional 2D-Conv blocks (2D convolutions). STH network achieves competitive or even better performance than its competitors on benchmark datasets such as Something-Something (V1 & V2), Jester, and HMDB-51. Moreover, STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.

show abstract

Section: Comparison With Different Convolutions On Something-somethinmentioning

confidence: 99%

A Spatio-temporal Hybrid Network for Action Recognition

Zhao

2019

2019 IEEE Visual Communications and Image Processing (VCIP)

View full text Add to dashboard Cite

show abstract

“…Gilmer et al [14] Later formulated the message passing module in GNNs as a learnable neural network. Recently, GNNs have been successfully applied in many fields, including molecular biology [14], computer vision [48,71,76], machine learning [62] and natural language processing [2]. Another popular trend in GNNs is to generalize the convolutional architecture over arbitrary graph-structured data [10,40,26], which is called graph convolution neural network (GCNN).…”

Section: Graph Neural Networkmentioning

confidence: 99%

Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

Yang

Shen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

242

129

View full text Add to dashboard Cite

This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS). The suggested AGNN recasts this task as a process of iterative information fusion over video graphs. Specifically, AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. The underlying pair-wise relations are described by a differentiable attention mechanism. Through parametric message passing, AGNN is able to efficiently capture and mine much richer and higher-order relations between video frames, thus enabling a more complete understanding of video content and more accurate foreground estimation. Experimental results on three video segmentation datasets show that AGNN sets a new state-ofthe-art in each case. To further demonstrate the generalizability of our framework, we extend AGNN to an additional task: image object co-segmentation (IOCS). We perform experiments on two famous IOCS datasets and observe again the superiority of our AGNN model. The extensive experiments verify that AGNN is able to learn the underlying semantic/appearance relationships among video frames or related images, and discover the common objects.

show abstract

“…One limitation of this approach is that objects are detected by an object detector pre-trained on extra training data with a closed vocabulary. In contrast to [47], we build a category-agnostic relation module to detect any context that is highly related to the action in a weaklysupervised manner, without the need of object detectors or extra ... annotations. The most closely related work is [40], which treats each location in the feature map of the image as an object proxy and computes actor-object relation feature maps via an additional convolutional layer.…”

Section: Related Workmentioning

confidence: 99%

Long Short-Term Relation Networks for Video Action Detection

Yao

Qiu

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

It has been well recognized that modeling human-object or objectobject relations would be helpful for detection task. Nevertheless, the problem is not trivial especially when exploring the interactions between human actor, object and scene (collectively as humancontext) to boost video action detectors. The difficulty originates from the aspect that reliable relations in a video should depend on not only short-term human-context relation in the present clip but also the temporal dynamics distilled over a long-range span of the video. This motivates us to capture both short-term and long-term relations in a video. In this paper, we present a new Long Short-Term Relation Networks, dubbed as LSTR, that novelly aggregates and propagates relation to augment features for video action detection. Technically, Region Proposal Networks (RPN) is remoulded to first produce 3D bounding boxes, i.e., tubelets, in each video clip. LSTR then models short-term human-context interactions within each clip through spatio-temporal attention mechanism and reasons long-term temporal dynamics across video clips via Graph Convolutional Networks (GCN) in a cascaded manner. Extensive experiments are conducted on four benchmark datasets, and superior results are reported when comparing to state-of-the-art methods. CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.

show abstract

Videos as Space-Time Region Graphs

Cited by 631 publications

References 64 publications

A Spatio-temporal Hybrid Network for Action Recognition

A Spatio-temporal Hybrid Network for Action Recognition

Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks

Long Short-Term Relation Networks for Video Action Detection

Contact Info

Product

Resources

About