Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Tewel, Yoad; Shalev, Yoav; Nadler, Roy; Wolf, Lior

doi:10.48550/arxiv.2207.11100

Cited by 3 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use a zero-shot captioning model that generates captions directly from the given video data. For the zero-shot captioning model, we utilized a model [28] that combines a vision transformer [29] and GPT-2 [30]. Furthermore, we employed a sentence transformer [31] to extract embedding vectors for each caption.…”

Section: ) Caption Generationmentioning

confidence: 99%

“…A zero-shot captioning model [28] is used for generating captions, and [31] is employed as the model for extracting caption embeddings. Among these models, ResNet50 [37], which extracts features from the video, is the one that undergoes training.…”

Section: B Experiments Setup 1) Model Settingmentioning

confidence: 99%

“…The Encoder(ReNet-50) is initialized with pre-trained parameters from the ImageNet1k [38]. The caption generator [28] generates captions consisting of a maximum of 20 words. The queue size for negative samples is set to 65,536, the momentum value is 0.999, and the temperature term is set to 0.07.…”

Section: ) Implementation Detailsmentioning

confidence: 99%

See 2 more Smart Citations

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Park,

Kim,

Seok

et al. 2024

IEEE Access

View full text Add to dashboard Cite

This paper proposes a video scene segmentation framework referred to as a Contrasting Multi-Modal Similarity (CMS). Video is composed of multiple scenes which are short stories or semantic units of video, with each scene consisting of multiple shots. The task of video scene segmentation aims to semantically segment long videos, such as movies, into the sequence of scenes by identifying the boundaries of each scene transition. Current video scene segmentation frameworks have primarily relied on comparing only the visual cues of adjacent shots to identify scene boundaries. These frameworks have focused on two major approaches: 1) comparing only the visual cues of adjacent frames to distinguish between scenes and 2) performing clustering based on visual cues for distinction among scenes. However, within videos, there exist numerous scenes that are difficult to distinguish using visual information alone, as they often appear similar or ambiguous. Taking inspiration from the aforementioned issues, we propose a framework referred to as CMS that leverages not only visual cues (i.e., shots) but also textual cues (i.e., captions) to semantically distinguish scenes. The new framework, CMS, leverages visual cues and text cues as follows: (1) Generate captions corresponding to each shot using a zero-shot captioning model (Caption Generation). ( 2) Construct similarity score matrices for each modality to measure semantic similarities (Similarity Score Calculation).(3) Based on the above matrix, select similar shots and dissimilar shots for contrastive training (Similarity Score-based Sampling). Our experiments show that the CMS framework advances the performance to exceed the previous state-of-the-art methods with a relatively simple approach without complex model architectures.

show abstract

Section: ) Caption Generationmentioning

confidence: 99%

Section: B Experiments Setup 1) Model Settingmentioning

confidence: 99%

See 1 more Smart Citation

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Park,

Kim,

Seok

et al. 2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…We tried the newly available GPT-powered captioning models [25]. Although gaining the most attention recently, model hallucinations of the GPT-powered ones introduce factual elements to the content -after processing a video of a math class, the model placed too much emphasis on the teacher and generated discussion about his religion and race.…”

Section: Video Captioningmentioning

confidence: 99%

Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives

Yang,

Zhang

2023

Proceedings of the 5th Workshop on analySis, Understanding and proMotion of heritAge Contents

View full text Add to dashboard Cite

Audiovisual (AV) archives are invaluable for holistically preserving the past. Unlike other forms, AV archives can be difficult to explore. This is not only because of its complex modality and sheer volume but also the lack of appropriate interfaces beyond keyword search. The recent rise in text-to-video retrieval tasks in computer science opens the gate to accessing AV content more naturally and semantically, able to map natural language descriptive sentences to matching videos. However, applications of this model are rarely seen. The contribution of this work is threefold. First, working with RTS (Télévision Suisse Romande), we identified the key blockers in a real archive for implementing such models. We built a functioning pipeline for encoding raw archive videos to the text-to-video feature vectors. Second, we designed and verified a method to encode and retrieve videos using emotionally abundant descriptions not supported in the original model. Third, we proposed an initial prototype for immersive and interactive exploration of AV archives in a latent space based on the previously mentioned encoding of videos. CCS CONCEPTS• Applied computing → Arts and humanities; • Human-centered computing → Human computer interaction (HCI); Visualization.

show abstract

“…Li et al (2023a) pretrain additional lightweight modules that bridge the frozen image encoder and LLMs to eliminate the modality gap between the two frozen pretrained models. Tewel et al (2022) connect the frozen image encoder with the frozen language decoder and evolve additional pseudo tokens during inference time to perform the video captioning task. Recently, there have been efforts to integrate these two different approaches.…”

Section: Related Workmentioning

confidence: 99%

Can Language Models Laugh at YouTube Short-form Videos?

Ko,

Lee,

Kim

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

As short-form funny videos on social networks are gaining popularity, it becomes demanding for AI models to understand them for better communication with humans. Unfortunately, previous video humor datasets target specific domains such as speeches or sitcoms, and mostly focus on verbal cues. We curate a usergenerated dataset of 10K multimodal funny videos from YouTube, called ExFunTube. Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor. After filtering, we annotate each video with timestamps and text explanations for funny moments. Our ExFunTube is unique over existing datasets in that our videos cover a wide range of domains with various types of humor that necessitate a multimodal understanding of the content. Also, we develop a zero-shot video-to-text prompting to maximize video humor understanding of large language models (LLMs). With three different evaluation methods using automatic scores, rationale quality experiments, and human evaluations, we show that our prompting significantly improves LLMs' ability for humor explanation.

show abstract

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Cited by 3 publications

References 0 publications

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Contrasting Multi-Modal Similarity Framework for Video Scene Segmentation

Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives

Can Language Models Laugh at YouTube Short-form Videos?

Contact Info

Product

Resources

About