Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Kim, Hyounghun; Tang, Zineng; Bansal, Mohit

doi:10.18653/v1/2020.acl-main.435

Cited by 28 publications

(25 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inspired by recent work (Kim and Bansal, 2019;Kim et al, 2020) that uses dense captions (Johnson et al, 2016;Yang et al, 2017) to improve image and video QA models, we propose to add dense captions an as auxiliary text input that provide aligned visual cues to ease the difficulties of learning a video-text matching objective from often temporally and semantically misaligned ASR captions. In addition, we also propose a constrained attention loss, which employs an entropy minimizationbased regularization (Tanaka et al, 2018;Yi and Wu, 2019) to the model to encourage higher attention scores from the video to the correct matched caption among a pool of ASR caption candidates.…”

Section: Related Workmentioning

confidence: 99%

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Tang¹,

Lei²,

Bansal³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many videoand-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DECEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss. 1

show abstract

Section: Related Workmentioning

confidence: 99%

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Tang¹,

Lei²,

Bansal³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

View full text Add to dashboard Cite

show abstract

“…Each of these auxiliary tasks generate uni-modal outputs, dealing either with image or text. In a bid to combine the benefits of learning signals from both visuo-spatial and language modalities, we propose the use of dense captioning as the dual task, which has proven useful as a source of complementary information for many vision-language tasks (Wu et al, 2019;Kim et al, 2020;Li et al, 2019b). Dense captioning models provide regional bounding boxes for objects in the input image and also describe the region.…”

Section: Vlcmentioning

confidence: 99%

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

Maharana¹,

Bansal²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

While much research has been done in textto-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an explicit narrative structure that needs to be translated into an image sequence (or visual story). Prior work in this domain has shown that there is ample room for improvement in the generated image sequence in terms of visual quality, consistency and relevance. In this paper, we first explore the use of constituency parse trees using a Transformer-based recurrent architecture for encoding structured input. Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story. Third, we also incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images within a dual learning setup. We show that off-theshelf dense-captioning models trained on Visual Genome can improve the spatial structure of images from a different target domain without needing fine-tuning. We train the model end-to-end using intra-story contrastive loss (between words and image sub-regions) and show significant improvements in several metrics (and human evaluation) for multiple datasets. Finally, we provide an analysis of the linguistic and visuo-spatial information. 1

show abstract

“…To train the ranker, we used a binary cross entropy loss, where paragraphs containing gold SFs (henceforth, supporting paragraphs) are used as positive instances and the other distractor paragraphs are negative instances. Following Kim et al (2020), we also randomly sample one supporting paragraph from other questions for each question and used them as negative instances.…”

Section: Relevant Paragraph Predictionmentioning

confidence: 99%

Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension

Inoue¹,

Trivedi²,

Sinha³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

How can we generate concise explanations for multi-hop Reading Comprehension (RC)? The current strategies of identifying supporting sentences can be seen as an extractive questionfocused summarization of the input text. However, these extractive explanations are not necessarily concise i.e. not minimally sufficient for answering a question. Instead, we advocate for an abstractive approach, where we propose to generate a question-focused, abstractive summary of input paragraphs and then feed it to an RC system. Given a limited amount of human-annotated abstractive explanations, we train the abstractive explainer in a semi-supervised manner, where we start from the supervised model and then train it further through trial and error maximizing a conciseness-promoted reward function. Our experiments demonstrate that the proposed abstractive explainer can generate more compact explanations than an extractive explainer with limited supervision (only 2k instances) while maintaining sufficiency.

show abstract

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Cited by 28 publications

References 39 publications

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization

Summarize-then-Answer: Generating Concise Explanations for Multi-hop Reading Comprehension

Contact Info

Product

Resources

About