Video Visual Relation Detection

Shang, Xindi; Ren, Tongwei; Guo, Jun; Zhang, Hanwang; Chua, Tat-Seng

doi:10.1145/3123266.3123380

Cited by 115 publications

(144 citation statements)

References 34 publications

Supporting

Mentioning

144

Contrasting

Order By: Relevance

“…One of the key challenges of learning relationships in videos has been the lack of relevant annotated datasets. In this context, the recent work of [29] is inspiring as it contributes manually annotated relations for the ImageNet video dataset. Our work improves upon [29] on multiple fronts: (1) Instead of assuming no temporal contingency between relationships, we introduce a gated fully-connected spatio-temporal energy graph for modeling the inherently rich structure from videos; (2) We extend the study of relation triplet from subject/predicate/object to a more general setting, such as object/verb/scene [32]; (3) We consider a new task 'relation recognition' (apart from relation detection and tagging) which requires the model to make predictions in a fine-grained manner; (4) For various metrics and tasks, our model demonstrates improved performance.…”

Section: Related Workmentioning

confidence: 99%

“…Evaluation for different methods on ImageNet Video dataset. * denotes the re-implementation of [29] after fixing the bugs in their released method code (by contacting authors). † denotes the implementation with additional triplet loss term for language priors [20].…”

Section: Inference Message Passing and Learningmentioning

confidence: 99%

“…(a) ImageNet Video [24] contains videos (from daily-life as well as in-the-wild) with manually labeled bounding boxes for objects. We utilize the annotations from [29], in which a subset of the videos having rich visual relationships were selected (1, 000 videos in total with 800 for training & rest for evaluation, available at [28] Table 2. Evaluation for different methods on Charades dataset.…”

Section: Inference Message Passing and Learningmentioning

confidence: 99%

“…The output Y s t , Y p t , and Y o t follow categorical distribution. As in [29], in the t th chunk of the input instance, we choose the subject and object features (i.e., X s t and X o t ) to be the averaged features for the Faster-RCNN label probability distribution outputs. X p t , on the other hand, is chosen to be the concate-nation of the following three features: the improved dense trajectory (iDT) feature [37] for subject tracklet, the iDT feature for object tracklet, and the relative spatio-temporal positions [29] between subject and object tracklets.…”

Section: Inference Message Passing and Learningmentioning

confidence: 99%

See 3 more Smart Citations

Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph

Tsai

Divvala

Morency

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship {man, open, door} involves a complex relation {open} between concrete entities {man, door}. While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., {man, lift up, box} vs. {man, put down, box}), as well as relationships that are often correlated through time (e.g., {woman, pay, money} followed by {woman, buy, coffee}). In this paper, we construct a Conditional Random Field on a fully-connected spatiotemporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Inference Message Passing and Learningmentioning

confidence: 99%

Section: Inference Message Passing and Learningmentioning

confidence: 99%

Section: Inference Message Passing and Learningmentioning

confidence: 99%

See 2 more Smart Citations

Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph

Tsai

Divvala

Morency

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

show abstract

“…Nonetheless, all of them were curated only based on textual resources while neglecting the rich information existing in visual data. Thereafter, many efforts have been paid to extracting knowledge from visual data, such as NEIL [4], Visual Genome [13] and VidVRD [21]. Even though many researches targeted at extracting knowledge from both textual and visual data, few works aim to extract knowledge in vertical domains like fashion.…”

Section: Related Workmentioning

confidence: 99%

Who, Where, and What to Wear?

Yang

Liao

et al. 2019

Proceedings of the 27th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Fashion knowledge helps people to dress properly and addresses not only physiological needs of users, but also the demands of social activities and conventions. It usually involves three mutually related aspects of: occasion, person and clothing. However, there are few works focusing on extracting such knowledge, which will greatly benefit many downstream applications, such as fashion recommendation. In this paper, we propose a novel method to automatically harvest fashion knowledge from social media. We unify three tasks of occasion, person and clothing discovery from multiple modalities of images, texts and metadata. For person detection and analysis, we use the off-the-shelf tools due to their flexibility and satisfactory performance. For clothing recognition and occasion prediction, we unify the two tasks by using a contextualized fashion concept learning module, which captures the dependencies and correlations among different fashion concepts. To alleviate the heavy burden of human annotations, we introduce a weak label modeling module which can effectively exploit machine-labeled data, a complementary of clean data. In experiments, we contribute a benchmark dataset and conduct extensive experiments from both quantitative and qualitative perspectives. The results demonstrate the effectiveness of our model in fashion concept prediction, and the usefulness of extracted knowledge with comprehensive analysis. CCS CONCEPTS• Information systems → Specialized information retrieval.

show abstract

Human-Centric Visual Relation Segmentation Using Mask R-CNN and VTransE

Tan

Ren

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Video Visual Relation Detection

Cited by 115 publications

References 34 publications

Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph

Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph

Who, Where, and What to Wear?

Human-Centric Visual Relation Segmentation Using Mask R-CNN and VTransE

Contact Info

Product

Resources

About