Yoav Shalev scite author profile

Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. While such models can provide a powerful score for matching and subsequent zero-shot tasks, they are not capable of generating caption given an image. In this work, we repurpose such models to generate a descriptive text given an image at inference time, without any further training or tuning step. This is done by combining the visual-semantic model with a large language model, benefiting from the knowledge in both web-scale models. The resulting captions are much less restrictive than those obtained by supervised captioning methods. Moreover, as a zero-shot learning method, it is extremely flexible and we demonstrate its ability to perform image arithmetic in which the inputs can be either images or text and the output is a sentence. This enables novel high-level vision capabilities such as comparing two images or solving visual analogy tests. Our code is available at: https://github. com/YoadTew/zero-shot-image-to-text.

show abstract

End to End Lip Synchronization with a Temporal AutoEncoder

Shalev

Wolf

2020

View full text Add to dashboard Cite

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Tewel¹,

Shalev²,

Nadler³

et al. 2022

Preprint

View full text Add to dashboard Cite

End to End Lip Synchronization with a Temporal AutoEncoder

Shalev¹,

Lior²

2022

Preprint

View full text Add to dashboard Cite

We study the problem of syncing the lip movement in a video with the audio stream. Our solution finds an optimal alignment using a dual-domain recurrent neural network that is trained on synthetic data we generate by dropping and duplicating video frames. Once the alignment is found, we modify the video in order to sync the two sources. Our method is shown to greatly outperform the literature methods on a variety of existing and new benchmarks. As an application, we demonstrate our ability to robustly align text-to-speech generated audio with an existing video stream. Our code and samples are available at https://github.com/itsyoavshalev/End-to-End-Lip-Synchronization-with-a-Temporal-AutoEncoder.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yoav Shalev

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

End to End Lip Synchronization with a Temporal AutoEncoder

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

End to End Lip Synchronization with a Temporal AutoEncoder

Contact Info

Product

Resources

About