Jingjing Jiang scite author profile

Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components. CCS CONCEPTS• Computing methodologies → Computer vision tasks; • Information systems → Question answering.

show abstract

A joint object detection and semantic segmentation model with cross-attention and inner-attention mechanisms

Nan

Peng

Jiang

et al. 2021

Neurocomputing

View full text Add to dashboard Cite

Learning to Infer Unseen Attribute-Object Compositions

Chen¹,

Nan²,

Jiang³

et al. 2020

Preprint

View full text Add to dashboard Cite

The composition recognition of unseen attribute-object is critical to make machines learn to decompose and compose complex concepts like people. Most of the existing methods are limited to the composition recognition of single-attribute-object, and can hardly distinguish the compositions with similar appearances. In this paper, a graph-based model is proposed that can flexibly recognize both single-and multi-attribute-object compositions. The model maps the visual features of images and the attribute-object category labels represented by word embedding vectors into a latent space. Then, according to the constraints of the attribute-object semantic association, distances are calculated between visual features and the corresponding label semantic features in the latent space. During the inference, the composition that is closest to the given image feature among all compositions is used as the reasoning result. In addition, we build a large-scale Multi-Attribute Dataset (MAD) with 116,099 images and 8,030 composition categories. Experiments on MAD and two other single-attribute-object benchmark datasets demonstrate the effectiveness of our approach.

show abstract

Learning to Infer Unseen Single-/ Multi-Attribute-Object Compositions With Graph Networks

Chen

Jiang

Zheng

2023

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Jiang

Liu

Zheng

2023

IEEE Trans. Multimedia

View full text Add to dashboard Cite

X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering

Jiang¹,

Liu²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jingjing Jiang

Line Feature Based Extrinsic Calibration of LiDAR and Camera

Predicting short-term next-active-object through visual attention and hand position

X-GGM: Graph Generative Modeling for Out-of-distribution Generalization in Visual Question Answering

A joint object detection and semantic segmentation model with cross-attention and inner-attention mechanisms

Learning to Infer Unseen Attribute-Object Compositions

Learning to Infer Unseen Single-/ Multi-Attribute-Object Compositions With Graph Networks

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

X-GGM: Graph Generative Modeling for Out-of-Distribution Generalization in Visual Question Answering

Contact Info

Product

Resources

About