2024
DOI: 10.1109/lsp.2023.3340103
|View full text |Cite
|
Sign up to set email alerts
|

GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features

Yunzhuo Sun,
Yifang Xu,
Zien Xie
et al.
Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 16 publications
0
2
0
Order By: Relevance
“…For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b. UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.…”
Section: Video Temporal Groundingmentioning
confidence: 99%
See 1 more Smart Citation
“…For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b. UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.…”
Section: Video Temporal Groundingmentioning
confidence: 99%
“…UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets. To accelerate the training convergence of VTG, GPTSee [14] introduces LLMs to generate prior positional information for the transformer decoder. However, these supervised approaches inevitably rely on extensive human-annotated data and training resources.…”
Section: Video Temporal Groundingmentioning
confidence: 99%