Abstract:This paper targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the lar… Show more
“…Cross-lingual cross-modal retrieval has been garnering increased attention amongst researchers, as it enables the acquisition of images or videos utilizing a non-English query, without relying on human-labeled vision-target language data (Li et al 2023;Wu et al 2023). This method mitigates the constraints of conventional cross-modal retrieval tasks (Dong et al 2022a;Zheng et al 2023) centered on English, and offers a highly efficient and cost-effective solution for target-language based retrieval, greatly reducing the need for human-labeled data. In terms of the model architecture, there are mainly two broad directions to conduct CCR.…”
Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.
“…Cross-lingual cross-modal retrieval has been garnering increased attention amongst researchers, as it enables the acquisition of images or videos utilizing a non-English query, without relying on human-labeled vision-target language data (Li et al 2023;Wu et al 2023). This method mitigates the constraints of conventional cross-modal retrieval tasks (Dong et al 2022a;Zheng et al 2023) centered on English, and offers a highly efficient and cost-effective solution for target-language based retrieval, greatly reducing the need for human-labeled data. In terms of the model architecture, there are mainly two broad directions to conduct CCR.…”
Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval.
“…Therefore, the main challenge in such setting is how to align multi-modal features well to predict precise boundary. Some works (Gao et al 2017;Zhang et al 2019;Yuan et al 2019;Zhang et al 2020b;Chen et al 2018;Qu et al 2020;Fang et al 2022Fang et al , 2023cFang et al , 2020Fang and Hu 2020;Fang et al 2021a,b;Liu et al 2022dLiu et al , 2023bFang et al 2023b;Zheng et al 2023;Zhu et al 2023;Liu et al 2023c,d) integrate sentence information with each fine-grained video clip unit, and predict the scores of candidate segments by gradually merging the fusion feature sequence over time. Without using proposals, some latest methods (Nan et al 2021;Zhang et al 2020a;Chen et al 2020a) are proposed to leverage the interaction between video and sentence to directly predict the starting and ending frames.…”
Temporal sentence localization (TSL) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant yet expensive manual annotations for training. Moreover, these trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, in this paper, we target another more practical but challenging setting: unsupervised domain adaptative temporal sentence localization (UDA-TSL), which explores whether the localization knowledge can be transferred from a fully-annotated data domain (source domain) to a new unannotated data domain (target domain). Particularly, we propose an effective and novel baseline for UDA-TSL to bridge the multi-modal gap across different domains and learn the potential correspondence between the video-query pairs in target domain. We first develop separate modality-specific domain adaptation modules to smoothly balance the minimization of the domain shifts in cross-dataset video and query domains. Then, to fully exploit the semantic correspondence of both modalities in target domain for unsupervised localization, we devise a mutual information learning module to adaptively align the video-query pairs which are more likely to be relevant in target domain, leading to more truly aligned target pairs and ensuring the discriminability of target features. In this way, our model can learn domain-invariant and semantic-aligned cross-modal representations. Three sets of migration experiments show that our model achieves competitive performance compared to existing methods.
“…In practice, VMR is an extremely challenging task because the desired model should (i) cover various moment lengths in multiple scenarios; (ii) bridge the semantic gap between different modalities (video and query); (iii) understand the semantic details of different modalities to extract modal-invariant features for optimal retrieval. Most previous VMR works (Zheng et al 2023;Shen et al 2023;Yang et al 2022;Dong et al 2022aDong et al ,b,c, 2023bSun et al 2023;Ma et al 2020;Liu et al 2018Ge et al 2019;Zhang et al 2019a;Qu et al 2023Qu et al , 2021Wen et al 2023bWen et al , 2021Wen et al , 2023a are under fully-supervised setting, where each frame is manually labeled as queryrelevant or not. Therefore, the main challenge in such a setting is how to align multi-modal features well to predict precise moment boundaries.…”
Given an untrimmed video and a sentence query, video moment retrieval using language (VMR) aims to locate a target query-relevant moment. Since the untrimmed video is overlong, almost all existing VMR methods first sparsely down-sample each untrimmed video into multiple fixed-length video clips and then conduct multi-modal interactions with the query feature and expensive clip features for reasoning, which is infeasible for long real-world videos that span hours. Since the video is downsampled into fixed-length clips, some query-related frames may be filtered out, which will blur the specific boundary of the target moment, take the adjacent irrelevant frames as new boundaries, easily leading to cross-modal misalignment and introducing both boundary-bias and reasoning-bias. To this end, in this paper, we propose an efficient approach, SpotVMR, to trim the query-relevant clip. Besides, our proposed SpotVMR can serve as plug-and-play module, which achieves efficiency for state-of-the-art VMR methods while maintaining good retrieval performance. Especially, we first design a novel clip search model that learns to identify promising video regions to search conditioned on the language query. Then, we introduce a set of low-cost semantic indexing features to capture the context of objects and interactions that suggest where to search the query-relevant moment. Also, the distillation loss is utilized to address the optimization issues arising from end-to-end joint training of the clip selector and VMR model.
Extensive experiments on three challenging datasets demonstrate its effectiveness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.