Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Jiang, Haojun; Lin, Yuanze; Dongchen, Han,; Song, Sejun; Huang, Gao

doi:10.1109/cvpr52688.2022.01507

Cited by 22 publications

(10 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Pseudo-query generation is critical in zero-shot localization methods, although limited work has been done in this direction. Nam et al (2021) introduce pseudo-query generation for video localization, and subsequently, Jiang et al (2022) for language grounding in images. Nam et al (2021) consider a pseudoquery to be an unordered list of nouns and verbs, obtained from an off-the-shelf object detector and a fine-tuned language model (LM) that predicts the most probable verbs conditioned on the nouns.…”

Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning

confidence: 99%

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

show abstract

Section: Weakly Supervised and Zero-shot Nlvl Methodsmentioning

confidence: 99%

Commonsense for Zero-Shot Natural Language Video Localization

Holla,

Lourentzou

2024

AAAI

View full text Add to dashboard Cite

show abstract

“…These fully supervised REC, However, depends on large annotated datasets. Weakly supervised methods (Liu et al, 2019;Sun et al, 2021) don't require manually annotated bounding boxes and unsupervised methods (Jiang et al, 2022) that require neither manually annotated bounding boxes nor queries have also been studied. Pseudo-Q (Jiang et al, 2022) proposed a method for generating pseudo queries with objects, attributes, and spatial relationships as key components, outperforming the weakly supervised methods.…”

Section: Referring Expression Comprehensionmentioning

confidence: 99%

ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes

Kato,

Kurita,

Chu

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

3D referring expression comprehension is a task to ground text representations onto objects in 3D scenes. It is a crucial task for indoor household robots or augmented reality devices to localize objects referred to in user instructions. However, existing indoor 3D referring expression comprehension datasets typically cover larger object classes that are easy to localize, such as chairs, tables, or doors, and often overlook small objects, such as cooking tools or office supplies. Based on the recently proposed diverse and high-resolution 3D scene dataset of ARKitScenes, we construct the ARKitSceneRefer dataset focusing on small daily-use objects that frequently appear in real-world indoor scenes. ARKitSceneRefer contains 15k objects of 1, 605 indoor scenes, which are significantly larger than those of the existing 3D referring datasets, and covers diverse object classes of 583 from the LVIS dataset. In empirical experiments with both 2D and 3D state-of-theart referring expression comprehension models, we observed the task difficulty of the localization in the diverse small object classes.

show abstract

“…The task of Referring Expression Comprehension (ReC) plays a crucial role in applications such as robot navigation and visual question answering. ReC methods can be roughly classified into three types: fully supervised (Deng et al, 2021;, weakly supervised (Gupta et al, 2020;Liu et al, 2019aSun et al, 2021), and unsupervised (Jiang et al, 2022;Subramanian et al, 2022;Wang and Specia, 2019;Yeh et al, 2018).…”

Section: Referring Expression Comprehensionmentioning

confidence: 99%

“…To address the annotation challenges, several works (Yeh et al, 2018;Wang and Specia, 2019;Jiang et al, 2022) have explored unsupervised approaches that do not rely on paired annotations. Nonetheless, these approaches either employ statistical hypothesis testing, make simple assumptions, or only investigate the shallow relation between objects, leading to poor performance in complex scenes.…”

Section: Introductionmentioning

confidence: 99%

“…Nonetheless, these approaches either employ statistical hypothesis testing, make simple assumptions, or only investigate the shallow relation between objects, leading to poor performance in complex scenes. For instance, Pseudo-Q (Jiang et al, 2022), which adopts a method of generating pseudo-labels to train a supervised model. It utilizes an offline object detector to extract salient objects from images and generates pseudo queries using predefined templates with object labels and attributes.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Scene Graph Enhanced Pseudo-Labeling for Referring Expression Comprehension

Wu,

Cai,

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Referring Expression Comprehension (ReC) is a task that involves localizing objects in images based on natural language expressions. Most ReC methods typically approach the task as a supervised learning problem. However, the need for costly annotations, such as clear image-text pairs or region-text pairs, hinders the scalability of existing approaches. In this work, we propose a novel scene graph-based framework that automatically generates highquality pseudo region-query pairs. Our method harnesses scene graphs to capture the relationships between objects in images and generate expressions enriched with relation information.To ensure accurate mapping between visual regions and text, we introduce an external module that employs a calibration algorithm to filter out ambiguous queries. Additionally, we employ a rewriter module to enhance the diversity of our generated pseudo queries through rewriting. Extensive experiments demonstrate that our method outperforms previous pseudolabeling methods by about 10%, 12%, and 11% on RefCOCO, RefCOCO+, and RefCOCOg, respectively. Furthermore, it surpasses the stateof-the-art unsupervised approach by more than 15% on the RefCOCO dataset.

show abstract

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Cited by 22 publications

References 42 publications

Commonsense for Zero-Shot Natural Language Video Localization

Commonsense for Zero-Shot Natural Language Video Localization

ARKitSceneRefer: Text-based Localization of Small Objects in Diverse Real-World 3D Indoor Scenes

Scene Graph Enhanced Pseudo-Labeling for Referring Expression Comprehension

Contact Info

Product

Resources

About