Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413975
|View full text |Cite
|
Sign up to set email alerts
|

Dual Path Interaction Network for Video Moment Localization

Abstract: Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., topdown approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 52 publications
(23 citation statements)
references
References 46 publications
0
23
0
Order By: Relevance
“…However, it is also a highly computationally intensive scheme, so they use some tricks such as sampling to reduce the amount of operations. Wang et al [53] argued that utilizing both the frame-level and the candidate-level features will create complementary advantages. Therefore, they proposed Dual Path Interaction Network (DPIN), which utilizes Semantically Conditioned Interaction to accomplish the information transformation between the two levels.…”
Section: A Supervised Methodsmentioning
confidence: 99%
“…However, it is also a highly computationally intensive scheme, so they use some tricks such as sampling to reduce the amount of operations. Wang et al [53] argued that utilizing both the frame-level and the candidate-level features will create complementary advantages. Therefore, they proposed Dual Path Interaction Network (DPIN), which utilizes Semantically Conditioned Interaction to accomplish the information transformation between the two levels.…”
Section: A Supervised Methodsmentioning
confidence: 99%
“…While proposal-based approaches show reliable results, they are sensitive to proposal quality and suffer from the prohibitive cost of creating proposals, as well as the computationally inefficient comparison of all proposaltarget pairings. Another line of works are the proposal-free approaches [6,7,8,15,29,31,36,42,45,52,53,55], which tries to regress the timespans directly. They are more flexible than proposal-based approaches in terms of granularity.…”
Section: Related Workmentioning
confidence: 99%
“…The majority of existing methods for video grounding can be categorized into two families: 1) proposal-based methods [2,5,13,14,17,25,26,33,43,45,48,50,54,56,57,58], which All codes and models will be made available shortly. generate a bunch of proposals in advance and select the best match with target spans, and 2) proposal-free methods [6,7,8,15,29,31,36,42,45,52,53,55], which estimate start and end timestamps aligned to the given description directly. The proposal-based approaches generally show strong performance at the expense of prohibitive cost of proposal generation.…”
Section: Introductionmentioning
confidence: 99%
“…The latest methods either follow SAP to predict the probabilities of boundary across frames [3,32,6,4,22] or in the same spirit as MCN to select from a set of pre-defined proposals constructed by explicit sliding windows [9,19] or implicit multi-granularity anchors [33,29,31]. Recently, DPIN [27] proposed to combine the two localisation strategies by a dual path interaction network so to take the advantage of both. Regardless of their remarkable success, fully-supervised methods rely heavily on the fine-grained temporal annotation, which is not only expensive but also prone to subjective bias [1].…”
Section: Strong Supervisionmentioning
confidence: 99%