STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

Cao, Da; Zeng, Yawen; Li, Meng; He, Xiangnan; Wang, Meng; Qin, Zheng

doi:10.1145/3394171.3413840

Cited by 39 publications

(27 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The action space for each step is a set of handcraft-designed temporal transformations (e.g., shifting, scaling). The typical methods include R-W-M [22], SM-RL [62], TripNet [21], STRONG [2], TSP-PRL [65] and AVMR [3].…”

Section: Reinforcement Learning-based Methodsmentioning

confidence: 99%

“…Wang et al [62] propose an RNN-based RL model which sequentially observes a selective set of video frames and finally obtains the temporal boundaries given the query. Cao et al [2] firstly leverage the spatial scene tracking task, which utilizes a spatial-level RL for filtering out the information that is not relevant to the text query. The spatial-level RL can enhance the temporallevel RL for adjusting the temporal boundaries of the video.…”

mentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Temporal Sentence Grounding in Videos

Lan¹,

Yuan²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Temporal sentence grounding in videos (TSGV), which aims to localize one target segment from an untrimmed video with respect to a given sentence query, has drawn increasing attentions in the research community over the past few years. Different from the task of temporal action localization, TSGV is more flexible since it can locate complicated activities via natural languages, without restrictions from predefined action categories. Meanwhile, TSGV is more challenging since it requires both textual and visual understanding for semantic alignment between two modalities (i.e., text and video). In this survey, we give a comprehensive overview for TSGV, which i) summarizes the taxonomy of existing methods, ii) provides a detailed description of the evaluation protocols (i.e., datasets and metrics) to be used in TSGV, and iii) in-depth discusses potential problems of current benchmarking designs and research directions for further investigations. To the best of our knowledge, this is the first systematic survey on temporal sentence grounding. More specifically, we first discuss existing TSGV approaches by grouping them into four categories, i.e., two-stage methods, end-to-end methods, reinforcement learning-based methods, and weakly supervised methods. Then we present the benchmark datasets and evaluation metrics to assess current research progress. Finally, we discuss some limitations in TSGV through pointing out potential problems improperly resolved in the current evaluation protocols, which may push forwards more cutting edge research in TSGV. Besides, we also share our insights on several promising directions, including three typical tasks with new and practical settings based on TSGV.

show abstract

Section: Reinforcement Learning-based Methodsmentioning

confidence: 99%

mentioning

confidence: 99%

A Survey on Temporal Sentence Grounding in Videos

Lan¹,

Yuan²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The task localizes a video segment by a distinct and describable sentence from a video. One kind of dichotomy is late-fusion [1] and early-fusion [6,12,13,23,24,28,34,43,45]. Late-fusion approach computes offline query-agnostic video feature while early-fusion approach computes query-aware video features.…”

Section: Related Workmentioning

confidence: 99%

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Hou

Ngo

Chan

2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel CONtextual QUery-awarE Ranking (CONQUER) model for effective moment localization and ranking. CONQUER explores query context for multi-modal fusion and representation learning in two different steps. The first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to query. We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos, to investigate the potential advantages of fusing video and query online as a joint representation for moment retrieval. CCS CONCEPTS• Information systems → Multimedia and multimodal retrieval; Video search.

show abstract

“…Deep learning has achieved great success in the filed of multimedia [5,9,17,27] in recent years, due to the advanced learning ability of models, the growing computing capability of machines, and the availability of big data. A learning model fed with sufficient, highquality data is likely to yield more accurate results.…”

Section: Introductionmentioning

confidence: 99%

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Wang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Few-shot action recognition has drawn growing attention as it can recognize novel action classes by using only a few labeled samples. In this paper, we propose a novel semantic-guided relation propagation network (SRPN), which leverages semantic information together with visual information for few-shot action recognition. Different from most previous works that neglect semantic information in the labeled data, our SRPN directly utilizes the semantic label as an additional supervisory signal to improve the generalization ability of the network. Besides, we treat the relation of each visual-semantic pair as a relational node, and we use a graph convolutional network to model and propagate such sample relations across visual-semantic pairs, including both intra-class commonality and inter-class uniqueness, to guide the relation propagation in the graph. However, since videos contain crucial sequences and ordering information, we propose a novel spatial-temporal difference module, which can facilitate the network to enhance the visual feature learning ability at both feature level and granular level for videos. Extensive experiments conducted on several challenging benchmarks demonstrate that our SRPN outperforms several state-of-the-art methods with a significant margin.

show abstract

STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization

Cited by 39 publications

References 36 publications

A Survey on Temporal Sentence Grounding in Videos

A Survey on Temporal Sentence Grounding in Videos

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

Semantic-Guided Relation Propagation Network for Few-shot Action Recognition

Contact Info

Product

Resources

About