Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Zhang, Zhu; Lin, Zhijie; Zhao, Zhou; Xiao, Zhenxin

doi:10.1145/3331184.3331235

Cited by 198 publications

(153 citation statements)

References 38 publications

(77 reference statements)

Supporting

Mentioning

153

Contrasting

Order By: Relevance

“…Activity-Caption [18] was built on ActivityNet v1.3 dataset [14] with diverse context. Following [48,50], we use val_1 as validation set and val_2 as testing set. We have 37, 417, 17, 505, and 17, 031 moment-sentence pairs for training, validation, and testing, respectively.…”

Section: Datasetsmentioning

confidence: 99%

“…It is the percentage that at least one of the candidate moments with top-n scores have Intersection over Union (IoU) larger than m. We report the result of n ∈ {1, 5} with m ∈ {0.1, 0.3, 0.5} for TACoS, n ∈ {1, 5} with m ∈ {0.5, 0.7} for Charades-STA, and n ∈ {1, 5} with m ∈ {0.3, 0.5, 0.7} for Activity-Caption, respectively. We evaluate our proposed DPIN approach on three datasets and compare our model with the state-of-the-art methods, including: Candidatebased (top-down) approaches: CTRL [9], MCF [39], ACRN [24], SAP [7], CMIN [50], ACL [10], SCDM [43], ROLE [25], SLTA [16],MAN [47], Xu et al [41], SCDM [43], 2D-TAN [48]. Frame-based (Bottomup) approaches: ABLR [44],GDP [6], TGN [4], CBP [36], ExCL [11], DEBUG [27].…”

Section: Performance Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Dual Path Interaction Network for Video Moment Localization

Wang

Zha

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., topdown approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidatelevel representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches. CCS CONCEPTS • Information systems → Video search; Novelty in information retrieval.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Performance Comparisonmentioning

confidence: 99%

Dual Path Interaction Network for Video Moment Localization

Wang

Zha

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…And [Shou et al, 2017] employ temporal upsampling and spatial downsampling operations simultaneously. Furthermore, [Zhao et al, 2017] model the temporal structure of each action instance via a temporal pyramid. skip the proposal generation and directly detect action instances based on temporal convolutional layers.…”

Section: Related Workmentioning

confidence: 99%

Localizing Unseen Activities in Video via Image Query

Zhang

Zhao

Lin

et al. 2019

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence

Self Cite

View full text Add to dashboard Cite

Action localization in untrimmed videos is an important topic in the field of video understanding. However, existing action localization methods are restricted to a pre-defined set of actions and cannot localize unseen activities. Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization. This task faces three inherent challenges:(1) how to eliminate the influence of semantically inessential contents in image queries; (2) how to deal with the fuzzy localization of inaccurate image queries; (3) how to determine the precise boundaries of target segments. We then propose a novel self-attention interaction localizer to retrieve unseen activities in an end-to-end fashion. Specifically, we first devise a region self-attention method with relative position encoding to learn fine-grained image region representations. Then, we employ a local transformer encoder to build multi-step fusion and reasoning of image and video contents. We next adopt an order-sensitive localizer to directly retrieve the target segment. Furthermore, we construct a new dataset ActivityIBAL by reorganizing the ActivityNet dataset. The extensive experiments show the effectiveness of our method.

show abstract

“…As shown in Figure 1, the sentence describes multiple complicated events and corresponds to a temporal moment with complex object interactions. Recently, a large amount of methods [4,12,15,33,40] have been proposed to this challenging task and achieved satisfactory performance. However, most existing approaches are trained in the fully-supervised setting with the temporal alignment annotation of each sentence.…”

Section: Introductionmentioning

confidence: 99%

“…As for the two-branch proposal module, two branches have a completely consistent structure and share all parameters. We first develop a conventional cross-modal interaction [4,40] between language and frame sequences. Next, we apply a 2D moment map [39] to capture relationships between adjacent moments.…”

Section: Introductionmentioning

confidence: 99%

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Zhang

Lin

Zhao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

Self Cite

View full text Add to dashboard Cite

Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here 1. CCS CONCEPTS • Information systems → Video search; • Computing methodologies → Activity recognition and understanding.

show abstract

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Cited by 198 publications

References 38 publications

Dual Path Interaction Network for Video Moment Localization

Dual Path Interaction Network for Video Moment Localization

Localizing Unseen Activities in Video via Image Query

Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos

Contact Info

Product

Resources

About