GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Sun, Yunzhuo; Xu, Yifang; Xie, Zien; Shu, Yukun; Du, Sidan

doi:10.1109/lsp.2023.3340103

Cited by 2 publications

(2 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b. UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.…”

Section: Video Temporal Groundingmentioning

confidence: 99%

“…UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets. To accelerate the training convergence of VTG, GPTSee [14] introduces LLMs to generate prior positional information for the transformer decoder. However, these supervised approaches inevitably rely on extensive human-annotated data and training resources.…”

Section: Video Temporal Groundingmentioning

confidence: 99%

See 1 more Smart Citation

VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT

Xu,

Sun,

Xie

et al. 2024

Applied Sciences

Self Cite

View full text Add to dashboard Cite

Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on GitHub.

show abstract