Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1659
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Abstractive Summarization for How2 Videos

Abstract: In this paper, we study abstractive summarization for open-domain videos. Unlike the traditional text news summarization, the goal is less to "compress" text information but rather to provide a fluent textual summary of information that has been collected and fused from different source modalities, in our case video and audio transcripts (or text). We show how a multi-source sequence-to-sequence model with hierarchical attention can integrate information from different modalities into a coherent output, compar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
43
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 62 publications
(45 citation statements)
references
References 43 publications
(39 reference statements)
0
43
0
Order By: Relevance
“…Multimodal Summarization. A series of works Palaskar et al, 2019;Chan et al, 2019;Gao et al, 2020a) focused on generating better textual summaries with the help of multimodal input. Multimodal summarization with multimodal output is relatively less explored.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Multimodal Summarization. A series of works Palaskar et al, 2019;Chan et al, 2019;Gao et al, 2020a) focused on generating better textual summaries with the help of multimodal input. Multimodal summarization with multimodal output is relatively less explored.…”
Section: Related Workmentioning
confidence: 99%
“…How2: a model proposed to generate textual summary with video information (Palaskar et al, 2019). Synergistic: a image-question-answer synergistic network to value the role of the answer for precise visual dialog (Guo et al, 2019).…”
Section: Multimodal Baselinesmentioning
confidence: 99%
“…VideoRNN (Palaskar et al, 2019): a baseline of the video-only model implemented on the How2 dataset.…”
Section: Baseline Modelsmentioning
confidence: 99%
“…Existing approaches have obtained promising results. For example, Libovickỳ et al (2018) and Palaskar et al (2019) utilize multiple encoders to encode videos and audio transcripts and a joint decoder to decode the multisource encodings, which acquire better performance than single modality structures. Despite the effectiveness of these approaches, they only perform multimodal fusion during the decoding stage to generate a target sequence, lacking fine-grained interactions between multisource inputs to complete the missing information of each modality.…”
Section: Introductionmentioning
confidence: 99%
“…In particular, Ramanishka et al (2016) However -we suspect that instructional video domain is significantly different than MSR-VTT (where the audio information does not necessarily correspond to human speech), as we find that ASR-only models significantly surpass the state-of-the-art video model in our case. Palaskar et al (2019) and Shi et al (2019), contemporaneous with the submission of the present work, also examine ASR as a source of signal for generating how-to video captions.…”
Section: Related Workmentioning
confidence: 99%