Generalized Decoding for Pixel, Image, and Language

Zou, Xueyan; Dou, Zi-Yi; Yang, Jun; Gan, Zhe; Fu, Pingqing; Li, Chunyuan; Dai, Xiyang; Behl, Harkirat Singh; Wang, Jianfeng; Liu, Yuan; Peng, Nanyun; Wang, Limin; Lee, Yong Jae; Gao, Jianfeng

doi:10.48550/arxiv.2212.11270

Cited by 4 publications

(15 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the visual backbone, we adopt pretrained Swin-T/L [34] by default. We also use Focal-T [48] in our ablation studies following [60]. For the language backbone, we adopt the pretrained base model in UniCL [49].…”

Section: Methodsmentioning

confidence: 99%

“…For the language backbone, we adopt the pretrained base model in UniCL [49]. Particularly, our model only uses these pretrained backbones and does not use other image-text pairs or grounding data for pretraining [29,60]. During pretraining, we set a minibatch for segmentation to 32 and detection to 64, and the image resolution is 1024 × 1024 for both segmentation and detection.…”

Section: Methodsmentioning

confidence: 99%

“…During fine-tuning, we use 512 × 1024 for Cityscapes [6] and 640 × 640 for ADE20K [57] by default. Following the balanced sampling strategy in [49,60], the segmentation data are always sampled for a consistent number of epochs, regardless of the total number of detection data. We use AdamW [35] as the optimizer.…”

Section: Methodsmentioning

confidence: 99%

“…Apart from using foundation models, DenseCLIP [40] and GroupViT [45] show that fine-tuning from a foundation model or training from scratch can also yield superior zero-shot performance. Recently, X-Decoder [60] proposes to unify all types of segmentation tasks and several vision-language tasks for open-vocabulary segmentation. In ODISE [46], the authors study a new way of using a text-toimage diffusion model as the backbone for open-vocabulary segmentation.…”

Section: Related Workmentioning

confidence: 99%

“…After pretraining our OpenSeeD on COCO and Ob-jects365, we evaluate it on a wide range of datasets in a zero-shot manner. Following [60], we cover six commonly used segmentation datasets, including indoor scenes (ADE20K [57]), outdoor scenes (Cityscapes [6]), and driving scenes (BDD100K [51]). In addition, we evaluate both segmentation and detection performance on LVIS [14].…”

Section: Open-vocabulary Benchmarkingmentioning

confidence: 99%

See 4 more Smart Citations

A Simple Framework for Open-Vocabulary Segmentation and Detection

Zhang¹,

Li²,

Zou³

et al. 2023

Preprint

View full text Add to dashboard Cite

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Open-vocabulary Benchmarkingmentioning

confidence: 99%

See 3 more Smart Citations

A Simple Framework for Open-Vocabulary Segmentation and Detection

Zhang¹,

Li²,

Zou³

et al. 2023

Preprint

View full text Add to dashboard Cite

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

Shi

Hayat

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of existing approaches are taskspecific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better leverage multi-modal data. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

show abstract

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Wang,

Shi,

Wang

et al. 2024

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.

View full text Add to dashboard Cite

Modern information querying systems are progressively incorporating multimodal inputs like vision and audio. However, the integration of gaze --- a modality deeply linked to user intent and increasingly accessible via gaze-tracking wearables --- remains underexplored. This paper introduces a novel gaze-facilitated information querying paradigm, named G-VOILA, which synergizes users' gaze, visual field, and voice-based natural language queries to facilitate a more intuitive querying process. In a user-enactment study involving 21 participants in 3 daily scenarios (p = 21, scene = 3), we revealed the ambiguity in users' query language and a gaze-voice coordination pattern in users' natural query behaviors with G-VOILA. Based on the quantitative and qualitative findings, we developed a design framework for the G-VOILA paradigm, which effectively integrates the gaze data with the in-situ querying context. Then we implemented a G-VOILA proof-of-concept using cutting-edge deep learning techniques. A follow-up user study (p = 16, scene = 2) demonstrates its effectiveness by achieving both higher objective score and subjective score, compared to a baseline without gaze data. We further conducted interviews and provided insights for future gaze-facilitated information querying systems.

show abstract

Generalized Decoding for Pixel, Image, and Language

Cited by 4 publications

References 0 publications

A Simple Framework for Open-Vocabulary Segmentation and Detection

A Simple Framework for Open-Vocabulary Segmentation and Detection

ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues

G-VOILA: Gaze-Facilitated Information Querying in Daily Scenarios

Contact Info

Product

Resources

About