2019
DOI: 10.48550/arxiv.1906.07901
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Abstractive Summarization for How2 Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
20
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(20 citation statements)
references
References 0 publications
0
20
0
Order By: Relevance
“…Multimodel video captioning task is to generate captions given an input video together with ASR transcript. Different from existing works (Sun et al, 2019b,a;Krishna et al, 2017;Zhou et al, 2018a,b;Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) which only use video signal, recent works (Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) study the multimodal captioning by taking both video and transcript as input, and show that incorporating transcript can largely improve the performance. Our model achieves state-of-the-art results in both tasks.…”
Section: Related Workmentioning
confidence: 92%
“…Multimodel video captioning task is to generate captions given an input video together with ASR transcript. Different from existing works (Sun et al, 2019b,a;Krishna et al, 2017;Zhou et al, 2018a,b;Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) which only use video signal, recent works (Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) study the multimodal captioning by taking both video and transcript as input, and show that incorporating transcript can largely improve the performance. Our model achieves state-of-the-art results in both tasks.…”
Section: Related Workmentioning
confidence: 92%
“…A more radical step towards building system with better "real world understanding" could arise from multimodal learners designed to aggregate audio, video and text modalities, from movies from instance. Promising results have already been obtained along this path [15].…”
Section: What's Next?mentioning
confidence: 86%
“…Fu et al [29] and Li et al [61] in their respective works use pre-trained CNNs to encode individual frames, and then feed them as input to randomly initialized bi-directional RNNs to capture the temporal dependencies across these frames. Libovickỳ et al [62] and Palaskar et al [83] use ResNeXt-101 3D Convolutional Neural Network [33] trained to recognize 400 diverse human actions on the Kinetics dataset [48] to tackle the problem of generating text summaries for tutorial videos from How2 dataset [100].…”
Section: Neural Modelsmentioning
confidence: 99%
“…Decoder: Depending on the encoding strategy used, the textual decoders also vary from plain unidirectional RNN [133] generating a word at a time to hierarchical RNN decoders [12] performing this step in multiple levels of granularity. Although a vast majority of neural models focus only on generating textual summary using multi-modal information as input [13,56,57,62,62,83,83], some work also output images as an supplement to the generated summary [12,29,61,133,134]; reinforcing the textual information and improving the user experience. These works either use a post-processing strategy to select the image(s) to become a part of final multi-modal summary [12,133], or they incorporate this functionality in their proposed model [29,61,134].…”
Section: Neural Modelsmentioning
confidence: 99%