2022
DOI: 10.48550/arxiv.2212.11270
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Generalized Decoding for Pixel, Image, and Language

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(15 citation statements)
references
References 0 publications
0
15
0
Order By: Relevance
“…For the visual backbone, we adopt pretrained Swin-T/L [34] by default. We also use Focal-T [48] in our ablation studies following [60]. For the language backbone, we adopt the pretrained base model in UniCL [49].…”
Section: Methodsmentioning
confidence: 99%
See 4 more Smart Citations
“…For the visual backbone, we adopt pretrained Swin-T/L [34] by default. We also use Focal-T [48] in our ablation studies following [60]. For the language backbone, we adopt the pretrained base model in UniCL [49].…”
Section: Methodsmentioning
confidence: 99%
“…For the language backbone, we adopt the pretrained base model in UniCL [49]. Particularly, our model only uses these pretrained backbones and does not use other image-text pairs or grounding data for pretraining [29,60]. During pretraining, we set a minibatch for segmentation to 32 and detection to 64, and the image resolution is 1024 × 1024 for both segmentation and detection.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations