Multimodal Abstractive Summarization for How2 Videos

Palaskar, Shruti; Libovický, Jindřich; Gella, Spandana; Metze, Florian

doi:10.48550/arxiv.1906.07901

Cited by 12 publications

(20 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Multimodel video captioning task is to generate captions given an input video together with ASR transcript. Different from existing works (Sun et al, 2019b,a;Krishna et al, 2017;Zhou et al, 2018a,b;Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) which only use video signal, recent works (Shi et al, 2019;Palaskar et al, 2019;Hessel et al, 2019) study the multimodal captioning by taking both video and transcript as input, and show that incorporating transcript can largely improve the performance. Our model achieves state-of-the-art results in both tasks.…”

Section: Related Workmentioning

confidence: 92%

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Luo,

Ji,

Shi

et al. 2020

Preprint

210

View full text Add to dashboard Cite

We propose UniViLM: a Unified Video and Language pre-training Model for multimodal understanding and generation. Motivated by the recent success of BERT based pre-training technique for NLP and image-language tasks, VideoBERT and CBT are proposed to exploit BERT model for video and language pretraining using narrated instructional videos. Different from their works which only pretrain understanding task, we propose a unified video-language pre-training model for both understanding and generation tasks. Our model comprises of 4 components including two single-modal encoders, a cross encoder and a decoder with the Transformer backbone. We first pre-train our model to learn the universal representation for both video and language on a large instructional video dataset. Then we fine-tune the model on two multimodal tasks including understanding task (text-based video retrieval) and generation task (multimodal video captioning). Our extensive experiments show that our method can improve the performance of both understanding and generation tasks and achieves the state-of-the art results.

show abstract

Section: Related Workmentioning

confidence: 92%

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Luo,

Ji,

Shi

et al. 2020

Preprint

210

View full text Add to dashboard Cite

show abstract

“…A more radical step towards building system with better "real world understanding" could arise from multimodal learners designed to aggregate audio, video and text modalities, from movies from instance. Promising results have already been obtained along this path [15].…”

Section: What's Next?mentioning

confidence: 86%

Deep Learning Models for Automatic Summarization

Lemberger

2020

Preprint

View full text Add to dashboard Cite

Text summarization is an NLP task which aims to convert a textual document into a shorter one while keeping as much meaning as possible. This pedagogical article reviews a number of recent Deep Learning architectures that have helped to advance research in this field. We will discuss in particular applications of pointer networks, hierarchical Transformers and Reinforcement Learning. We assume basic knowledge of Seq2Seq architecture and Transformer networks within NLP.

show abstract

“…Fu et al [29] and Li et al [61] in their respective works use pre-trained CNNs to encode individual frames, and then feed them as input to randomly initialized bi-directional RNNs to capture the temporal dependencies across these frames. Libovickỳ et al [62] and Palaskar et al [83] use ResNeXt-101 3D Convolutional Neural Network [33] trained to recognize 400 diverse human actions on the Kinetics dataset [48] to tackle the problem of generating text summaries for tutorial videos from How2 dataset [100].…”

Section: Neural Modelsmentioning

confidence: 99%

“…Decoder: Depending on the encoding strategy used, the textual decoders also vary from plain unidirectional RNN [133] generating a word at a time to hierarchical RNN decoders [12] performing this step in multiple levels of granularity. Although a vast majority of neural models focus only on generating textual summary using multi-modal information as input [13,56,57,62,62,83,83], some work also output images as an supplement to the generated summary [12,29,61,133,134]; reinforcing the textual information and improving the user experience. These works either use a post-processing strategy to select the image(s) to become a part of final multi-modal summary [12,133], or they incorporate this functionality in their proposed model [29,61,134].…”

Section: Neural Modelsmentioning

confidence: 99%

A Survey on Multi-modal Summarization

Jangra¹,

Mukherjee²,

Jatowt³

et al. 2021

Preprint

View full text Add to dashboard Cite

The new era of technology has brought us to the point where it is convenient for people to share their opinions over an abundance of platforms. These platforms have a provision for the users to express themselves in multiple forms of representations, including text, images, videos, and audio. This, however, makes it difficult for users to obtain all the key information about a topic, making the task of automatic multi-modal summarization (MMS) essential. In this paper, we present a comprehensive survey of the existing research in the area of MMS.

show abstract

Multimodal Abstractive Summarization for How2 Videos

Cited by 12 publications

References 0 publications

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation

Deep Learning Models for Automatic Summarization

A Survey on Multi-modal Summarization

Contact Info

Product

Resources

About