2022
DOI: 10.48550/arxiv.2207.11100
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Zero-Shot Video Captioning with Evolving Pseudo-Tokens

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(7 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…We use a zero-shot captioning model that generates captions directly from the given video data. For the zero-shot captioning model, we utilized a model [28] that combines a vision transformer [29] and GPT-2 [30]. Furthermore, we employed a sentence transformer [31] to extract embedding vectors for each caption.…”
Section: ) Caption Generationmentioning
confidence: 99%
See 2 more Smart Citations
“…We use a zero-shot captioning model that generates captions directly from the given video data. For the zero-shot captioning model, we utilized a model [28] that combines a vision transformer [29] and GPT-2 [30]. Furthermore, we employed a sentence transformer [31] to extract embedding vectors for each caption.…”
Section: ) Caption Generationmentioning
confidence: 99%
“…A zero-shot captioning model [28] is used for generating captions, and [31] is employed as the model for extracting caption embeddings. Among these models, ResNet50 [37], which extracts features from the video, is the one that undergoes training.…”
Section: B Experiments Setup 1) Model Settingmentioning
confidence: 99%
See 1 more Smart Citation
“…We tried the newly available GPT-powered captioning models [25]. Although gaining the most attention recently, model hallucinations of the GPT-powered ones introduce factual elements to the content -after processing a video of a math class, the model placed too much emphasis on the teacher and generated discussion about his religion and race.…”
Section: Video Captioningmentioning
confidence: 99%
“…Li et al (2023a) pretrain additional lightweight modules that bridge the frozen image encoder and LLMs to eliminate the modality gap between the two frozen pretrained models. Tewel et al (2022) connect the frozen image encoder with the frozen language decoder and evolve additional pseudo tokens during inference time to perform the video captioning task. Recently, there have been efforts to integrate these two different approaches.…”
Section: Related Workmentioning
confidence: 99%