Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Lan, Yunshi; Li, Xiang; Liu, Xin; Li, Yang; Qin, Wei; Qian, Weining

doi:10.1145/3581783.3612389

Cited by 4 publications

(2 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent developments have witnessed significant progress in the alignment of images with accompanying text, such as Contrastive Language-Image Pretraining (CLIP) [6]. In addition to the multimodal uses of CLIP [7][8][9][10][11], the visual features provided by CLIP have showcased remarkable versatility in diverse applications, such as captioning [12][13][14][15], object detection [16], semantic image segmentation [17], cross-modal retrieval tasks [18][19][20], etc. This wide-ranging utilization underscores the broad applicability and robust performance of CLIP and its derivatives across a spectrum of interdisciplinary challenges.…”

Section: Introductionmentioning

confidence: 99%

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Onuoha,

Flaherty,

Cong Thang

2024

Electronics

View full text Add to dashboard Cite

In recent studies, the Contrastive Language–Image Pretraining (CLIP) model has showcased remarkable versatility in downstream tasks, ranging from image captioning and question-answering reasoning to image–text similarity rating, etc. In this paper, we investigate the effectiveness of CLIP visual features in predicting perceptual image quality. CLIP is also compared with competitive large multimodal models (LMMs) for this task. In contrast to previous studies, the results show that CLIP and other LMMs do not always provide the best performance. Interestingly, our evaluation experiment reveals that combining visual features from CLIP or other LMMs with some simple distortion features can significantly enhance their performance. In some cases, the improvements are even more than 10%, while the prediction accuracy surpasses 90%.

show abstract

Section: Introductionmentioning

confidence: 99%

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Onuoha,

Flaherty,

Cong Thang

2024

Electronics

View full text Add to dashboard Cite

show abstract

“…poor translation quality. DDPM resolves this question by using a large-scale pre-training with text-to-image data [2]- [4] and integrating multimodal information like large-scale language models [13], [14].…”

Section: Introductionmentioning

confidence: 99%

ZstGAN: An adversarial approach for Unsupervised Zero-Shot Image-to-image Translation

et al. 2021

View full text Add to dashboard Cite

MirrorDiffusion: Stabilizing Diffusion Process in Zero-Shot Image Translation by Prompts Redescription and Beyond

Lin,

Xian,

Shi

et al. 2024

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question Prompts

Cited by 4 publications

References 57 publications

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

ZstGAN: An adversarial approach for Unsupervised Zero-Shot Image-to-image Translation

MirrorDiffusion: Stabilizing Diffusion Process in Zero-Shot Image Translation by Prompts Redescription and Beyond

Contact Info

Product

Resources

About