Video Relation Detection with Spatio-Temporal Graph

Qian, Xufeng; Zhuang, Yueting; Li, Yimeng; Xiao, Shaoning; Pu, Shiliang; Xiao, Jun

doi:10.1145/3343031.3351058

Cited by 71 publications

(71 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…or tree-near-?. Such intuition has been empirically shown benefits in boosting SGG [62,7,28,30,29,71,20,73,58,13,44,59,45]. More specifically, these methods use a conditional random field [79] to model the joint distribution of nodes and edges, where the context is incorporated by message passing among the nodes through edges via a multi-step meanfield approximation [26]; then, the model is optimized by the sum of cross-entropy (XE) losses of nodes (e.g., objects) and edges (e.g., relationships).…”

Section: Introductionmentioning

confidence: 94%

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Chen

Zhang

Xiao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

149

105

View full text Add to dashboard Cite

Scene graphs -objects as nodes and visual relationships as edges -describe the whereabouts and interactions of objects in an image for comprehensive scene understanding. To generate coherent scene graphs, almost all existing methods exploit the fruitful visual context by modeling message passing among objects. For example, "person" on "bike" can help to determine the relationship "ride", which in turn contributes to the confidence of the two objects. However, we argue that the visual context is not properly learned by using the prevailing cross-entropy based supervised learning paradigm, which is not sensitive to graph inconsistency: errors at the hub or non-hub nodes should not be penalized equally. To this end, we propose a Counterfactual critic Multi-Agent Training (CMAT) approach. CMAT is a multi-agent policy gradient method that frames objects into cooperative agents, and then directly maximizes a graph-level metric as the reward. In particular, to assign the reward properly to each agent, CMAT uses a counterfactual baseline that disentangles the agent-specific reward by fixing the predictions of other agents. Extensive validations on the challenging Visual Genome benchmark show that CMAT achieves a state-of-the-art performance by significant gains under various settings and metrics.

show abstract

Section: Introductionmentioning

confidence: 94%

Counterfactual Critic Multi-Agent Training for Scene Graph Generation

Chen

Zhang

Xiao

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

149

105

View full text Add to dashboard Cite

show abstract

“…The idea of multiple hypothesis is first applied to this task by [1] which generates hypothesis for each object pair when performing association. [16] built a spatio-temporal graph between adjacent video segments and used multiple layers of graph convolutional networks to pass messages between graph nodes. Besides, they proposed an online association method with a siamese network and obtained the stateof-the-art results by combining these two parts.…”

Section: Related Workmentioning

confidence: 99%

“…relational association, which has the greatest difference between relation detection on video and image. The association method in [1] cannot satisfactorily handle various different predicates between each object pair while the siamese network in [16] only adds an appearance similarity score to the original greedy association method but suffers from extra complexity in the training process. In this paper, we differ from the framework of greedy association and propose a brand new effective association method which requires no training process.…”

Section: Related Workmentioning

confidence: 99%

“…Unlike visual relation detection in image (ImgVRD) that has been widely studied for years [5,13,15,[33][34][35], its counterpart in video domain has just attracted researchers' attention [16,19,23]. Video visual relation detection (VidVRD) requires to track the objects and their pairwise relations in a video.…”

Section: Introductionmentioning

confidence: 99%

“…However, such methods unavoidably produces inaccurate prediction and missing detection because of their heavy reliance on the performance of the prediction models. Though these models can be improved over short video segments by considering spatio-temporal context [16,25], they may still suffer from the bias and noise in learning and modeling long-tail data distribution, which is quite common in visual relations [9,18]. Alternatively, we take a different perspective by studying a more robust inference algorithm through multiple hypotheses.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Video Relation Detection via Multiple Hypothesis Association

Shang

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

show abstract