2023
DOI: 10.4018/ijswis.332768
|View full text |Cite
|
Sign up to set email alerts
|

Query-Guided Refinement and Dynamic Spans Network for Video Highlight Detection and Temporal Grounding in Online Information Systems

Yifang Xu,
Yunzhuo Sun,
Zien Xie
et al.

Abstract: With the surge in online video content, finding highlights and key video segments have garnered widespread attention. Given a textual query, video highlight detection (HD) and temporal grounding (TG) aim to predict frame-wise saliency scores from a video while concurrently locating all relevant spans. Despite recent progress in DETR-based works, these methods crudely fuse different inputs in the encoder, which limits effective cross-modal interaction. To solve this challenge, the authors design QD-Net (query-g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 40 publications
0
2
0
Order By: Relevance
“…For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b. UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.…”
Section: Video Temporal Groundingmentioning
confidence: 99%
See 1 more Smart Citation
“…For fully-supervised VTG, prior works [1,[9][10][11][13][14][15][16] typically employ encoders to extract visual and textual features, followed by designing a VTG model (e.g., transformer encoder-decoder) to interact and align two modalities, as depicted in Figure 1b. UniVTG [13] designs a multi-modal and multi-task learning pipeline, undergoing pretraining or fine-tuning on dozens of datasets.…”
Section: Video Temporal Groundingmentioning
confidence: 99%
“…Existing VTG methods [1,[9][10][11][12] primarily adopt supervised learning, which demands massive training resources and numerous annotated video-query pairs, as illustrated in Figure 1b. However, developing datasets for VTG is time-consuming and expensive; for instance, Moment-DETR [1] spent 1455 person-hours and USD 16,600 to create the QVhighlights dataset.…”
Section: Introductionmentioning
confidence: 99%