ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Tewel, Yoad; Shalev, Yoav; Wolf, Lior

doi:10.48550/arxiv.2111.14447

Cited by 6 publications

(15 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It should be emphasized that MAGIC Search allows us to directly plug visual controls into the decoding process of the language model, without the need of extra supervised training [14] or gradient update on additional features [14,78]. This property makes our method much more computationally efficient than previous approaches as demonstrated in our experiments (Section §4.1).…”

Section: Magic Searchmentioning

confidence: 94%

“…It has shown impressive zero-shot capabilities on various vision-language tasks and can open new avenues for answering the former question. ZeroCap [78] is the most related to our work. It is built on a pre-trained CLIP model together with the GPT-2 language model [60].…”

Section: Image Captioningmentioning

confidence: 99%

“…Similarly, in the text generation field, PPLM [14] extends the previous PPGN to text generation tasks (i.e., editing topic and sentiment), where the image generative models is replaced with a GPT-2 language model. Most recently, ZeroCap [78] shows image captioning task can be tackled by directly combining the existing CLIP and GPT-2 models. In general, most of these mentioned "plug and play" methods require iteratively shifting the hidden code (or latent code, or context cache) with gradient descent optimization.…”

Section: Plug and Play Generative Modelsmentioning

confidence: 99%

“…Given an image, it retrieves the most related caption from the training text corpus based on the image-text similarity as measured by CLIP. (3) Lastly, we compare with the current state-of-the-art approach, ZeroCap [78], which performs CLIP-guided gradient update on the language model features during the decoding process. For a fair comparison, we use the same language model for ZeroCap as in our approach.…”

Section: Zero-shot Image Captioningmentioning

confidence: 99%

“…However, such methods are usually limited by the object detectors that are trained with a fixed set of labels. The closest to our proposal is ZeroCap [78] which is an unsupervised image captioning method by combining frozen CLIP and GPT-2. One of the advantages of ZeroCap is it performs ex post facto in the activation space without re-training or fine-tuning the CLIP and GPT-2 models.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Language Models Can See: Plugging Visual Controls in Text Generation

Su¹,

Lan²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt. Our code, models and other related resources are publicly released at https://github.com/yxuansu/MAGIC.

show abstract

Section: Magic Searchmentioning

confidence: 94%

Section: Image Captioningmentioning

confidence: 99%

Section: Plug and Play Generative Modelsmentioning

confidence: 99%

Section: Zero-shot Image Captioningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Language Models Can See: Plugging Visual Controls in Text Generation

Su¹,

Lan²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Resonating with the consumer desires behind the screen – consumer-centric tourism advertising and new technology applications

Chen

2023

JBIM

View full text Add to dashboard Cite

Purpose The purpose of this study is to conceptually integrate business to consumer (B2C) into business to business (B2B), with a holistic consumer-centric, technology-reinforced, long-term vision for tourism industries and companies to survive and succeed in the era of new technologies 4.0. The research suggests that the tourism-marketing-new technologies decision-making involves customers as the center of the design and decision process. Design/methodology/approach The research design includes a qualitative study with 94 in-depth interviews, a literature analysis and a conceptual proposition. The qualitative study follows the tourism consumer desire data analysis, from categorization to integration. The literature analysis applies a systematic literature review approach based on the 29 most up-to-date new-tech papers from peer-reviewed journals. The analysis compares qualitative research findings and literature analysis results and matches the new technology applications with consumer desire understanding. The conceptual framework of tourism marketing/advertising is proposed based on qualitative research and literature analysis. Findings The qualitative research deciphers that consumers, based on their imagination and memorization, desire therapy and sceneries and connect such desires to the empathetic and resonating advertising messages. The literature analysis synthesizes the new tech applications in tourism and matches the qualitative research findings with the deciphered desires in tourism. The conceptual model proposes that B2C should be integrated into B2B to provide value for both consumers and businesses and opens avenues of research on this topic. Research limitations/implications This research has made the following theoretical contributions: it offers an in-depth understanding of consumer desire, often hidden or subconscious, in the field of tourism. Consumer desires regarding tourism are mostly subconscious and exist long before consumers are exposed to advertising messages. These desires reflect the search for therapy and sceneries and become “embodied” – they exist on multisensorial levels and become part of the body and life and will lead consumers into positive perceptions when marketing communications/advertisements resonate with them. In the latter case, they will subjectively judge advertising as “good,” regardless of the advertising design quality. The research also connects consumer research with a new technologies research review and proposes a conceptual framework to integrate business to consumer (B2C) with business to business (B2B). As such, the research makes theoretical contributions to the integration or the “boundary blurring” between B2C and B2B research and practical suggestions that involved industries and consumers may all benefit from such integration. Conceptually, there is a lack of discussions of the pitfalls of new technologies, a dearth of empirical verification of the applications of new technologies in the proposed fields and a shortage of discussions about ethical issues. Qualitative methods, offering an efficient tool for understanding consumer desires in the tourism industry, have their own limits, as discussed in previous research. The sample is limited to the state of New York population and may be influenced by geographic, demographic and psychological characteristics related to the region. Practical implications This research provides advertising practitioners, new technology innovators and tourism industries with a framework to face the combined challenges of understanding hidden consumer desires and applying adequate technologies that resonate with consumer desires to tackle relevant issues. The conceptual proposition of this research fills the gap between qualitative consumer research without concrete practical resolution and new technologies applications without in-depth consumer understanding. Through the conceptual framework, the author provides insights into how industries may benefit from consumer understanding. The business relationships among the industries of marketing, tourism and new technologies should be centered around consumers. Thus, B2C and B2B should be naturally integrated into business practices. Social implications Social implications of this research include three major points: first, the understanding of consumer desire for therapeutic power in tourism, which invites more attention to tourism as part of social well-being design instead of a purely for-profit business. Second, a profound comprehension of what consumers need and desire, without which the applications of new technologies may cause severe societal problems. Third, a way to tailor to consumers’ individuality and desires for advertising/marketing that may be considered abusive, stressful and socially destructive if applied in a nonpersonal manner. Originality/value Conceptually, this research adds consumer desire, an originally B2C concept, to the B2B context regarding the new technology applications in tourism marketing/advertising. It contributes to the B2B literature by proposing a strong consumer-centric approach, especially the consumer desire understanding, that is not yet investigated in the B2B literature; and a combination of empirical study and literature analysis and the matching of the two for better practice of advertising/marketing, tourism and new technologies applications.

show abstract

Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

Xiang,

Chen,

Liang

et al. 2023

Electronics

View full text Add to dashboard Cite

Unsupervised image captioning often grapples with challenges such as image–text mismatches and modality gaps, resulting in suboptimal captions. This paper introduces a semantic-enhanced cross-modal fusion model (SCFM) to address these issues. The SCFM integrates three innovative components: a text semantic enhancement network (TSE-Net) for nuanced semantic representation; contrast learning for optimizing similarity measures between text and images; and enhanced visual selection decoding (EVSD) for precise captioning. Unlike existing methods that struggle with capturing accurate semantic relationships and flexibility across scenarios, the proposed model provides a robust solution for unbiased and diverse captioning. Through experimental evaluations on the MS COCO and Flickr30k datasets, SCFM demonstrates significant improvements over the benchmark model, enhancing the CIDEr and BLEU-4 metrics by 3.6% and 3.2%, respectively. Visualization analysis further reveals the model’s superiority in increasing variability between hidden features and its potential in cross-domain and stylized image captioning. The findings not only contribute to the advancement of image captioning techniques but also open avenues for future research. Further investigations will explore SCFM’s adaptability to other multimodal tasks and refine it for more intricate image–text relationships.

show abstract

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Cited by 6 publications

References 45 publications

Language Models Can See: Plugging Visual Controls in Text Generation

Language Models Can See: Plugging Visual Controls in Text Generation

Resonating with the consumer desires behind the screen – consumer-centric tourism advertising and new technology applications

Semantic-Enhanced Cross-Modal Fusion for Improved Unsupervised Image Captioning

Contact Info

Product

Resources

About