2021
DOI: 10.48550/arxiv.2112.03162
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Embedding Arithmetic for Text-driven Image Transformation

Abstract: Latent text representations exhibit geometric regularities, such as the famous analogy: queen is to king what woman is to man. Such structured semantic relations were not demonstrated on image representations. Recent works aiming at bridging this semantic gap embed images and text into a multimodal space, enabling the transfer of text-defined transformations to the image modality.We introduce the SIMAT dataset to evaluate the task of text-driven image transformation. SIMAT contains 6k images and 18k "transform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 22 publications
(36 reference statements)
0
1
0
Order By: Relevance
“…By training with large-scale data, CLIP performs with high accuracy on tasks such as zero-shot text-to-image retrieval without additional training. The SIMAT Dataset [29] uses the latent space of CLIP to perform semantic operations on images and text.…”
Section: Multi-modal Representations By Contrastive Learningmentioning
confidence: 99%
“…By training with large-scale data, CLIP performs with high accuracy on tasks such as zero-shot text-to-image retrieval without additional training. The SIMAT Dataset [29] uses the latent space of CLIP to perform semantic operations on images and text.…”
Section: Multi-modal Representations By Contrastive Learningmentioning
confidence: 99%