Text2LIVE: Text-Driven Layered Image and Video Editing

Bar-Tal, Omer; Ofri-Amar, Dolev; Fridman, Rafail; Kasten, Yoni; Dekel, Tali

doi:10.1007/978-3-031-19784-0_41

Cited by 109 publications

(58 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Social interaction ever more takes place in mixed reality in the dawn of the metaverse (Zhang et al, 2023;Filipova, 2023). AI is transforming all scales of cognitive and social identities and interaction into life as we don't know it (Bar-Tal et al, 2022;Gilson et al, 2022;Chen, Hu, Saharia and Cohen, 2022;Cahan and Treutlein, 2023;King and chatGPT, 2023). This section argues that the ways in which one navigates AI-permated environments can be understood as a multiscale bio-cultural form of Augmented Cognition (AugCog).…”

Section: Preprint -Please Cite the Originalmentioning

confidence: 99%

Augmented Cognition: Life as we don't know it

Hipólito¹

2023

Preprint

View full text Add to dashboard Cite

This paper proposes a framework for comprehending the integration of Artificial Intelligence (AI) as Augmented Cognition (AugCog). AugCog is viewed as an emergent bio-cultural process, reflecting AI's design, implementation, and usage. Section 1 establishes smart societies as a complex system. Section 2 defends that the development of AI is analogous to the biological process of niche construction. Section 3 defines AI as a socioculturally embodied expansion by which AI is shaped by and shapes human experience, resulting in various forms of AugCog. Section 4 highlights AugCog in mixed realities such as social media, neurotechnology, and smart environments, illustrating its emergence from multiscale and interdependent sociocultural perspectives. AugCog's perspective situates the human species in the state space of evolution, signifying our existence as a species and as embodied individuals.

show abstract

Section: Preprint -Please Cite the Originalmentioning

confidence: 99%

Augmented Cognition: Life as we don't know it

Hipólito¹

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Text-conditioned generation The field of text-to-image generation has made significant progress in recent years, mainly using CLIP as a representation extractor. Many works use CLIP to optimize a latent vector in the representation space of a pretrained GAN [10,17,30,37], others utilize CLIP to provide classifier guidance for a pretrained diffusion model [3], and [5] employ CLIP to optimize a Deep Image Prior model [52] that correctly edits an image. Recently, the field has shifted from employing CLIP as a loss network for optimization, and into using it as a backbone in huge generative models [41,45], resulting in impressive photorealistic results.…”

Section: Related Workmentioning

confidence: 99%

“…Since the advent of CLIP [39], training large visionlanguage models (VLMs) has become a prominent paradigm for representation learning in computer vision. By observing huge corpora of paired images and captions crawled from the Web, these models learn a powerful and rich joint image-text embedding space, which have been employed in numerous visual tasks, including classification [60,61], segmentation [28,57], motion generation [49], image captioning [32,50], text-to-image generation [10,30,34,42,46] and image or video editing [3,5,7,17,24,37,54]. Recently, VLMs have also been a key component in text-toimage generative models [4,40,42,45], which rely on their textual representations to encapsulate the rich and semantic meaning of the input text prompt.…”

Section: Introductionmentioning

confidence: 99%

Teaching CLIP to Count to Ten

Paiss¹,

Ephrat²,

Tov³

et al. 2023

Preprint

View full text Add to dashboard Cite

object count. For example, an image depicting three dogs is paired with the caption "Six dogs playing in the yard". Our loss encourages discrimination between the correct caption and its counterfactual variant which serves as a hard negative example. To the best of our knowledge, this work is the first to extend CLIP's capabilities to object counting. Furthermore, we introduce "CountBench" -a new imagetext counting benchmark for evaluating a model's understanding of object counting. We demonstrate a significant improvement over state-of-the-art baseline models on this task. Finally, we leverage our count-aware CLIP model for image retrieval and text-conditioned image generation, demonstrating that our model can produce specific counts of objects more reliably than existing ones.

show abstract

“…Patashnik et al [44] adopt the CLIP model for semantic alignment between text and image, and propose mapping the text prompts to inputagnostic directions in StyleGAN's style space, achieving interactive text-driven image manipulation. Text2LIVE [45] introduces an edit layer to composite the generation results with the image to preserve the information. The edit layer is directly predicted by a U-Net model.…”

Section: Related Workmentioning

confidence: 99%

ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation

Wang

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

View full text Add to dashboard Cite

Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical applications. In this work, we study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM). The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the entity-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. To this end, we propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images that can not only edit the appearance of entities but also generate new entities corresponding to the text guidance. To solve eL-TGIM, SeMani decomposes the task into two phases: the semantic alignment phase and the image manipulation phase. In the semantic alignment phase, SeMani incorporates a semantic alignment module to locate the entity-relevant region to be manipulated. In the image manipulation phase, SeMani adopts a generative model to synthesize new images conditioned on the entity-irrelevant regions and target text descriptions. We discuss and propose two popular generation processes that can be utilized in SeMani, the discrete auto-regressive generation with transformers and the continuous denoising generation with diffusion models, yielding SeMani-Trans and SeMani-Diff, respectively. We conduct extensive experiments on the real datasets CUB, Oxford, and COCO datasets to verify that SeMani can distinguish the entity-relevant and -irrelevant regions and achieve more precise and flexible manipulation in a zero-shot manner compared with baseline methods. Our codes and models will be released at https://github.com/Yikai-Wang/SeMani.

show abstract

Text2LIVE: Text-Driven Layered Image and Video Editing

Cited by 109 publications

References 57 publications

Augmented Cognition: Life as we don't know it

Augmented Cognition: Life as we don't know it

Teaching CLIP to Count to Ten

ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation

Contact Info

Product

Resources

About