Meta Spatio-Temporal Debiasing for Video Scene Graph Generation

Liu, Xü; Qu, Haoxuan; Kuen, Jason; Gu, Jiuxiang; Li, Jun

doi:10.1007/978-3-031-19812-0_22

Cited by 13 publications

(6 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Open-vocabulary Visual Relationship Detection. The task of visual relationship detection in images (Lu et al 2016) or videos (Shang et al 2017), involving the classification and localization of relationship triplets, has become a hot topic in the field of computer vision (Tang et al 2020;Li et al 2022b;Cong, Yang, and Rosenhahn 2023;Zheng, Chen, and Jin 2022;Xu et al 2022;Chen, Xiao, and Chen 2023). This field has also explored the concept of zero-shot detection (Shang et al 2021), where all object and relationship categories are seen during training, but some certain triplet combinations remain unseen during test.…”

Section: Related Workmentioning

confidence: 99%

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Yang,

Wang,

et al. 2024

AAAI

View full text Add to dashboard Cite

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

show abstract

Section: Related Workmentioning

confidence: 99%

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Yang,

Wang,

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Gao et al [12] first proposed a compositional and motionbased relation prompt learning framework (RePro) in openvocabulary VidVRD setting. Albeit with these prior arts, only a few work has realized the long-tail predicate distribution as the bottleneck issue for VidSGG task [20,45].…”

Section: Video-based Scene Graph Generationmentioning

confidence: 99%

“…Li et al [20] proposed a causality-inspired interaction to weaken the false correlation between input data and predicate la-bels. Xu et al [45] considered temporal, spatial, and object biases in a meta-learning paradigm. These implicit approaches mitigate the long-tail problem to some extent, but the performance of tail classes is still unsatisfactory.…”

Section: Introductionmentioning

confidence: 99%

Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph Generation with Decoupled Label Learning

Wang¹,

Luo²,

Chen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Current video-based scene graph generation (VidSGG) methods have been found to perform poorly on predicting predicates that are less represented due to the inherent biased distribution in the training data. In this paper, we take a closer look at the predicates and identify that most visual relations (e.g. sit above) involve both actional pattern (sit) and spatial pattern (above), while the distribution bias is much less severe at the pattern level. Based on this insight, we propose a decoupled label learning (DLL) paradigm to address the intractable visual relation prediction from the pattern-level perspective. Specifically, DLL decouples the predicate labels and adopts separate classifiers to learn actional and spatial patterns respectively. The patterns are then combined and mapped back to the predicate. Moreover, we propose a knowledge-level label decoupling method to transfer non-target knowledge from head predicates to tail predicates within the same pattern to calibrate the distribution of tail classes. We validate the effectiveness of DLL on the commonly used VidSGG benchmark, i.e. VidVRD. Extensive experiments demonstrate that the DLL offers a remarkably simple but highly effective solution to the long-tailed problem, achieving the state-of-theart VidSGG performance.

show abstract

“…MAML [9], a popular meta-learning method, was originally designed to learn a good weight initialization that can quickly adapt to new tasks in testing, which showed promise in few-shot learning. Subsequently, its extension [28], which requires no model updating on the unseen testing scenarios, has been applied beyond few-shot learning, to enhance model performance [13,2,16,46]. Differently, we propose a novel framework via meta-learning to perform more reliable confidence estimation.…”

Section: Related Workmentioning

confidence: 99%

“…Meta-learning, also known as "learning to learn", allows us to train a model that can generalize well to different distributions. Specifically, in some metalearning works [9,28,13,2,16,46], a virtual testing set is used to mimic the testing conditions during training, so that even though training is mainly done on a virtual training set consisting of training data, performance on the testing scenario is improved. In our work, we construct our virtual testing sets such that they simulate various distributions that are different from the virtual training set, which will push our model to learn distribution-generalizable knowledge to perform well on diverse distributions, instead of learning distribution-specific knowledge that only performs well on the training distribution.…”

Section: Introductionmentioning

confidence: 99%

Improving the Reliability for Confidence Estimation

Foo

et al. 2022

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Confidence estimation, a task that aims to evaluate the trustworthiness of the model's prediction output during deployment, has received lots of research attention recently, due to its importance for the safe deployment of deep models. Previous works have outlined two important qualities that a reliable confidence estimation model should possess, i.e., the ability to perform well under label imbalance and the ability to handle various out-of-distribution data inputs. In this work, we propose a meta-learning framework that can simultaneously improve upon both qualities in a confidence estimation model. Specifically, we first construct virtual training and testing sets with some intentionally designed distribution differences between them. Our framework then uses the constructed sets to train the confidence estimation model through a virtual training and testing scheme leading it to learn knowledge that generalizes to diverse distributions. We show the effectiveness of our framework on both monocular depth estimation and image classification.

show abstract

Meta Spatio-Temporal Debiasing for Video Scene Graph Generation

Cited by 13 publications

References 43 publications

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Taking A Closer Look at Visual Relation: Unbiased Video Scene Graph Generation with Decoupled Label Learning

Improving the Reliability for Confidence Estimation

Contact Info

Product

Resources

About