Amit H. Bermano scite author profile

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks. Our code, data and new words will be available at: https://textual-inversion. github.io → Input samples invert −−−−→ "S * " "An oil painting of S * " "App icon of S * " "Elmo sitting in the same pose as S * " "Crochet S * " → Input samples invert −−−−→ "S * " "Painting of two S * fishing on a boat" "A S * backpack" "Banksy art of S * " "A S * themed lunchbox"Preprint. Under review.

show abstract

ClipCap: CLIP Prefix for Image Captioning

Mokady¹,

Hertz²,

Bermano³

2021

Preprint

108

137

View full text Add to dashboard Cite

Image captioning is a fundamental task in visionlanguage understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a simple approach to address this task. We use CLIP encoding as a prefix to the caption, by employing a simple mapping network, and then fine-tunes a language model to generate the image captions. The recently proposed CLIP model contains rich semantic features which were trained with textual context, making it best for vision-language perception. Our key idea is that together with a pre-trained language model (GPT2), we obtain a wide understanding of both visual and textual data. Hence, our approach only requires rather quick training to produce a competent captioning model. Without additional annotations or pre-training, it efficiently generates meaningful captions for large-scale and diverse datasets. Surprisingly, our method works well even when only the mapping network is trained, while both CLIP and the language model remain frozen, allowing a lighter architecture with less trainable parameters. Through quantitative evaluation, we demonstrate our model achieves comparable results to state-of-the-art methods on the challenging Conceptual Captions and nocaps datasets, while it is simpler, faster, and lighter. Our code is available in https://github. com/rmokady/CLIP_prefix_caption.

show abstract

Pivotal Tuning for Latent-based Editing of Real Images

Roich¹,

Mokady²,

Bermano³

et al. 2021

Preprint

125

View full text Add to dashboard Cite

Recently, a surge of advanced facial editing techniques have been proposed that leverage the generative power of a pre-trained StyleGAN. To successfully edit an image this way, one must first project (or invert) the image into the pre-trained generator's domain. As it turns out, however, StyleGAN's latent space induces an inherent tradeoff between distortion and editability, i.e. between maintaining the original appearance and convincingly altering some of its attributes. Practically, this means it is still challenging to apply ID-preserving facial latent-space editing to faces which are out of the generator's domain. In this paper, we present an approach to bridge this gap. Our technique slightly alters the generator, so that an out-of-domain image is faithfully mapped into an in-domain latent code. The key idea is pivotal tuning -a brief training process that preserves the editing quality of an in-domain latent region, while changing its portrayed identity and appearance. In Pivotal Tuning Inversion (PTI), an initial inverted latent code serves as a pivot, around which the generator is finedtuned. At the same time, a regularization term keeps nearby identities intact, to locally contain the effect. This surgical training process ends up altering appearance features that represent mostly identity, without affecting editing capabilities. To supplement this, we further show that pivotal tuning can also adjust the generator to accommodate a multitude of faces, while introducing negligible distortion on the rest of the domain. We validate our technique through inversion and editing metrics, and show preferable scores to state-of-the-art methods. We further qualitatively demonstrate our technique by applying advanced edits (such as pose, age, or expression) to numerous images of well-known and recognizable identities. Finally, we demonstrate resilience to harder cases, including heavy make-up, elaborate hairstyles and/or headwear, which otherwise could not have been successfully inverted and edited by state-of-the-art methods.

show abstract

Pivotal Tuning for Latent-based Editing of Real Images

et al. 2022

View full text Add to dashboard Cite

Recently, numerous facial editing techniques have been proposed that leverage the generative power of a pretrained StyleGAN. To successfully edit an image this way, one must first project (or invert) the image into the pretrained generator’s domain. As it turns out, StyleGAN’s latent space induces an inherent tradeoff between distortion and editability, i.e., between maintaining the original appearance and convincingly altering its attributes. Hence, it remains challenging to apply ID-preserving edits to real facial images. In this paper, we present an approach to bridge this gap. The idea is pivotal tuning — a brief training process that preserves editing quality, while surgically changing the portrayed identity and appearance. In Pivotal Tuning Inversion (PTI), an initial inverted latent code serves as a pivot, around which the generator is fine-tuned. At the same time, a regularisation term keeps nearby identities intact, to locally contain the effect. We further show that pivotal tuning also applies to accommodating for a multitude of faces, while introducing negligible distortion on the rest of the domain. We validate our technique through inversion and editing metrics, and show preferable scores to state-of-the-art methods. Lastly, we present successful editing for harder cases, including elaborate make-up or headwear

show abstract

MotionCLIP: Exposing Human Motion Generation to CLIP Space

et al. 2022

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Amit H. Bermano

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

ClipCap: CLIP Prefix for Image Captioning

Pivotal Tuning for Latent-based Editing of Real Images

Pivotal Tuning for Latent-based Editing of Real Images

MotionCLIP: Exposing Human Motion Generation to CLIP Space

Contact Info

Product

Resources

About