The platform will undergo maintenance on Sep 14 at about 7:45 AM EST and will be unavailable for approximately 2 hours.
2023
DOI: 10.1145/3543857
|View full text |Cite
|
Sign up to set email alerts
|

Progressive Localization Networks for Language-Based Moment Localization

Abstract: This paper targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the lar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
7
3

Relationship

0
10

Authors

Journals

citations
Cited by 15 publications
(5 citation statements)
references
References 60 publications
0
4
0
Order By: Relevance
“…Cross-lingual cross-modal retrieval has been garnering increased attention amongst researchers, as it enables the acquisition of images or videos utilizing a non-English query, without relying on human-labeled vision-target language data (Li et al 2023;Wu et al 2023). This method mitigates the constraints of conventional cross-modal retrieval tasks (Dong et al 2022a;Zheng et al 2023) centered on English, and offers a highly efficient and cost-effective solution for target-language based retrieval, greatly reducing the need for human-labeled data. In terms of the model architecture, there are mainly two broad directions to conduct CCR.…”
Section: Cross-lingual Cross-modal Retrievalmentioning
confidence: 99%
“…Cross-lingual cross-modal retrieval has been garnering increased attention amongst researchers, as it enables the acquisition of images or videos utilizing a non-English query, without relying on human-labeled vision-target language data (Li et al 2023;Wu et al 2023). This method mitigates the constraints of conventional cross-modal retrieval tasks (Dong et al 2022a;Zheng et al 2023) centered on English, and offers a highly efficient and cost-effective solution for target-language based retrieval, greatly reducing the need for human-labeled data. In terms of the model architecture, there are mainly two broad directions to conduct CCR.…”
Section: Cross-lingual Cross-modal Retrievalmentioning
confidence: 99%
“…Therefore, the main challenge in such setting is how to align multi-modal features well to predict precise boundary. Some works (Gao et al 2017;Zhang et al 2019;Yuan et al 2019;Zhang et al 2020b;Chen et al 2018;Qu et al 2020;Fang et al 2022Fang et al , 2023cFang et al , 2020Fang and Hu 2020;Fang et al 2021a,b;Liu et al 2022dLiu et al , 2023bFang et al 2023b;Zheng et al 2023;Zhu et al 2023;Liu et al 2023c,d) integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Without using proposals, some latest methods (Nan et al 2021;Zhang et al 2020a;Chen et al 2020a) are proposed to leverage the interaction between video and sentence to directly predict the starting and ending frames.…”
Section: Related Workmentioning
confidence: 99%
“…In practice, VMR is an extremely challenging task because the desired model should (i) cover various moment lengths in multiple scenarios; (ii) bridge the semantic gap between different modalities (video and query); (iii) understand the semantic details of different modalities to extract modal-invariant features for optimal retrieval. Most previous VMR works (Zheng et al 2023;Shen et al 2023;Yang et al 2022;Dong et al 2022aDong et al ,b,c, 2023bSun et al 2023;Ma et al 2020;Liu et al 2018Ge et al 2019;Zhang et al 2019a;Qu et al 2023Qu et al , 2021Wen et al 2023bWen et al , 2021Wen et al , 2023a are under fully-supervised setting, where each frame is manually labeled as queryrelevant or not. Therefore, the main challenge in such a setting is how to align multi-modal features well to predict precise moment boundaries.…”
Section: Introductionmentioning
confidence: 99%