Proceedings of the First International Workshop on Natural Language Processing Beyond Text 2020
DOI: 10.18653/v1/2020.nlpbt-1.7
|Get access via publisher |Cite
|
Sign up to set email alerts

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Abstract: This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -text, audio and video -in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2021
2021
2025
2025

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 37 publications
(4 citation statements)
references
References 27 publications
0
4
0
Order By: Relevance
“…For each video clip, following previous works (Sanabria et al, 2018;Palaskar et al, 2019;Khullar and Arora, 2020), a 2048-dimensional feature representation is extracted for every 16 non-overlapping frames using a 3D ResNeXt-101 model (Hara et al, 2018), which is pre-trained on the Kinetics dataset (Kay et al, 2017). Therefore, each data sample will have a sequence of 2048-vision feature vectors of length .…”
Section: Video Feature Extractionmentioning
confidence: 99%
“…For each video clip, following previous works (Sanabria et al, 2018;Palaskar et al, 2019;Khullar and Arora, 2020), a 2048-dimensional feature representation is extracted for every 16 non-overlapping frames using a 3D ResNeXt-101 model (Hara et al, 2018), which is pre-trained on the Kinetics dataset (Kay et al, 2017). Therefore, each data sample will have a sequence of 2048-vision feature vectors of length .…”
Section: Video Feature Extractionmentioning
confidence: 99%
“…[1] first proposed to utilise the visual and textual information of video clips into the summary generation process. Khullar and Arora [26] incorporated audio to generate a summary of video content with visual and textual modalities. Liu et al.…”
Section: Related Workmentioning
confidence: 99%
“…For each video clip, following previous works (Sanabria et al, 2018;Palaskar et al, 2019;Khullar and Arora, 2020), a 2048-dimensional feature representation is extracted for every 16 non-overlapping frames using a 3D ResNeXt-101 model (Hara et al, 2018), which is pre-trained on the Kinetics dataset (Kay et al, 2017). Therefore, each data sample will have a sequence of 2048-𝑑 vision feature vectors of length 𝑀.…”
Section: Video Feature Extractionmentioning
confidence: 99%