2020
DOI: 10.48550/arxiv.2010.08021
|Get access via publisher |Cite
Preprint
|
Sign up to set email alerts

MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention

Abstract: This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -text, audio and video -in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 12 publications
0
3
0
Order By: Relevance
“…Multimodal summarization focuses on distilling the most significant aspects of input data that span multiple modalities. Recent years have seen extensive exploration into generating text summaries from multimodal data, which may include text, visual, and auditory information [19][20][21]. A significant body of work has concentrated on incorporating visual information to enhance the quality of text summa-rization [3][4][5]22,23].…”
Section: Related Workmentioning
confidence: 99%
“…Multimodal summarization focuses on distilling the most significant aspects of input data that span multiple modalities. Recent years have seen extensive exploration into generating text summaries from multimodal data, which may include text, visual, and auditory information [19][20][21]. A significant body of work has concentrated on incorporating visual information to enhance the quality of text summa-rization [3][4][5]22,23].…”
Section: Related Workmentioning
confidence: 99%
“…Based on the How2, Palaskar et al [1] first proposed to utilise the visual and textual information of video clips into the summary generation process. Khullar and Arora [26] incorporated audio to generate a summary of video content with visual and textual modalities. Liu et al [3] conducted multistage fusion to interact multi-source modalities together and applied the forget gate module to resist the noise flows from multimodal semantics.…”
Section: Multimodal Abstractive Summarisationmentioning
confidence: 99%
“…dle the text input.Multimodal summarization aims to condense information from multimodal inputs, such as text, vision, and audio[12]. Recently, Ms has been extensively studied ([10,2,8,31]). A large number of the works focus on fusing visual information to improve the quality of text summaries [11].…”
mentioning
confidence: 99%