Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.144
|View full text |Cite
|
Sign up to set email alerts
|

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Abstract: Multimodal summarization for open-domain videos is an emerging task, aiming to generate a summary from multisource information (video, audio, transcript). Despite the success of recent multiencoder-decoder frameworks on this task, existing methods lack finegrained multimodality interactions of multisource inputs. Besides, unlike other multimodal tasks, this task has longer multimodal sequences with more redundancy and noise. To address these two issues, we propose a multistage fusion network with the fusion fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(16 citation statements)
references
References 27 publications
0
8
0
Order By: Relevance
“…Khullar and Arora [26] incorporated audio to generate a summary of video content with visual and textual modalities. Liu et al [3] conducted multistage fusion to interact multi-source modalities together and applied the forget gate module to resist the noise flows from multimodal semantics. Shang et al [27] introduced a novel short-term order-sensitive attention mechanism to leverage the time clue inside video frames.…”
Section: Multimodal Abstractive Summarisationmentioning
confidence: 99%
See 1 more Smart Citation
“…Khullar and Arora [26] incorporated audio to generate a summary of video content with visual and textual modalities. Liu et al [3] conducted multistage fusion to interact multi-source modalities together and applied the forget gate module to resist the noise flows from multimodal semantics. Shang et al [27] introduced a novel short-term order-sensitive attention mechanism to leverage the time clue inside video frames.…”
Section: Multimodal Abstractive Summarisationmentioning
confidence: 99%
“…To solve this issue, Liu et al. [3] proposed a single layer co‐attention among multi‐encoders to extract the multimodal semantics before the decoder, as shown in Figure 1b. These approaches only adopt a shallow fusion approach to model the semantics for multimodal fusion representation.…”
Section: Introductionmentioning
confidence: 99%
“…For text-based baselines, we employ Transformer (Vaswani et al 2017) and Pointer Generator Network (See, Liu, and Manning 2017) for generating explanations. In the multimodal setup, we adopt MFFG, the video summarization system proposed by Liu et al (2020). The MFFG architecture is a multi-stage fusion mechanism with a forget fusion gate acting as a multimodal noise filter.…”
Section: Comparative Systemsmentioning
confidence: 99%
“…Recently, research into multimodal abstractive summarization (MAS) has provided approaches for integrating image and text modalities into a short, concise and readable textual summary [ 2 , 3 ]. With the rapid development of deep learning technologies, more and more researchers have explored various methods for solving this task in unsupervised [ 4 , 5 ] or supervised [ 3 , 6 , 7 ] approaches. In general, the current deep-learning-based schemes are inseparable from the extracting feature then downstream processing [ 8 ] paradigm.…”
Section: Introductionmentioning
confidence: 99%
“…Current research focuses more on processes of the multimodal fusion and textual generation steps instead of feature extraction, as the feature extractors have already been widely used in the fields of natural language processing (NLP) and computer vision (CV) and obtain good performance. In approaches of multimodal fusion, multiple inputs are fused by attention-based [ 9 ] or gate-based [ 3 ] mechanisms in order to learn a representation that is suitable for summary generation. Such solutions concentrate on aggregating features from several modalities.…”
Section: Introductionmentioning
confidence: 99%