Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.435
|View full text |Cite
|
Sign up to set email alerts
|

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Abstract: Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
25
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2
1

Relationship

2
8

Authors

Journals

citations
Cited by 28 publications
(25 citation statements)
references
References 39 publications
0
25
0
Order By: Relevance
“…Inspired by recent work (Kim and Bansal, 2019;Kim et al, 2020) that uses dense captions (Johnson et al, 2016;Yang et al, 2017) to improve image and video QA models, we propose to add dense captions an as auxiliary text input that provide aligned visual cues to ease the difficulties of learning a video-text matching objective from often temporally and semantically misaligned ASR captions. In addition, we also propose a constrained attention loss, which employs an entropy minimizationbased regularization (Tanaka et al, 2018;Yi and Wu, 2019) to the model to encourage higher attention scores from the video to the correct matched caption among a pool of ASR caption candidates.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by recent work (Kim and Bansal, 2019;Kim et al, 2020) that uses dense captions (Johnson et al, 2016;Yang et al, 2017) to improve image and video QA models, we propose to add dense captions an as auxiliary text input that provide aligned visual cues to ease the difficulties of learning a video-text matching objective from often temporally and semantically misaligned ASR captions. In addition, we also propose a constrained attention loss, which employs an entropy minimizationbased regularization (Tanaka et al, 2018;Yi and Wu, 2019) to the model to encourage higher attention scores from the video to the correct matched caption among a pool of ASR caption candidates.…”
Section: Related Workmentioning
confidence: 99%
“…Each of these auxiliary tasks generate uni-modal outputs, dealing either with image or text. In a bid to combine the benefits of learning signals from both visuo-spatial and language modalities, we propose the use of dense captioning as the dual task, which has proven useful as a source of complementary information for many vision-language tasks (Wu et al, 2019;Kim et al, 2020;Li et al, 2019b). Dense captioning models provide regional bounding boxes for objects in the input image and also describe the region.…”
Section: Vlcmentioning
confidence: 99%
“…To train the ranker, we used a binary cross entropy loss, where paragraphs containing gold SFs (henceforth, supporting paragraphs) are used as positive instances and the other distractor paragraphs are negative instances. Following Kim et al (2020), we also randomly sample one supporting paragraph from other questions for each question and used them as negative instances.…”
Section: Relevant Paragraph Predictionmentioning
confidence: 99%