Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Liu, Nayu; Sun, Xian; Yu, Hongfeng; Zhang, Wenkai; Xu, Guangluan

doi:10.18653/v1/2020.emnlp-main.144

Cited by 26 publications

(16 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Khullar and Arora [26] incorporated audio to generate a summary of video content with visual and textual modalities. Liu et al [3] conducted multistage fusion to interact multi-source modalities together and applied the forget gate module to resist the noise flows from multimodal semantics. Shang et al [27] introduced a novel short-term order-sensitive attention mechanism to leverage the time clue inside video frames.…”

Section: Multimodal Abstractive Summarisationmentioning

confidence: 99%

See 1 more Smart Citation

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Yuan

Jing

Zheng

et al. 2023

IET Computer Vision

View full text Add to dashboard Cite

Multimodal abstractive summarisation (MAS) aims to generate a textual summary from multimodal data collection, such as video‐text pairs. Despite the success of recent work, the existing methods lack a thorough analysis for consistency across multimodal data. Besides, previous work relies on the fusion method to extract multimodal semantics, neglecting the constraints for complementary semantics of each modality. To address those issues, a multilayer cross‐fusion model with the reconstructor for the MAS task is proposed. Their model could thoroughly conduct cross‐fusion for each modality via layers of cross‐modal transformer blocks, resulting in cross‐modal fusion representations with consistency across modalities. Then the reconstructor is employed to reproduce source modalities based on cross‐modal fusion representations. The reconstruction process constrains the fusion representations with the complementary semantics of each modality. Comprehensive comparison and ablation experiments on the open domain multimodal dataset How2 are proposed. The results empirically verify the effectiveness of the multilayer cross‐fusion with the reconstructor structure on the proposed model.

show abstract

Section: Multimodal Abstractive Summarisationmentioning

confidence: 99%

“…To solve this issue, Liu et al. [3] proposed a single layer co‐attention among multi‐encoders to extract the multimodal semantics before the decoder, as shown in Figure 1b. These approaches only adopt a shallow fusion approach to model the semantics for multimodal fusion representation.…”

Section: Introductionmentioning

confidence: 99%

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Yuan

Jing

Zheng

et al. 2023

IET Computer Vision

View full text Add to dashboard Cite

show abstract

“…For text-based baselines, we employ Transformer (Vaswani et al 2017) and Pointer Generator Network (See, Liu, and Manning 2017) for generating explanations. In the multimodal setup, we adopt MFFG, the video summarization system proposed by Liu et al (2020). The MFFG architecture is a multi-stage fusion mechanism with a forget fusion gate acting as a multimodal noise filter.…”

Section: Comparative Systemsmentioning

confidence: 99%

Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation

Desai¹,

Chakraborty²,

Akhtar³

2022

AAAI

View full text Add to dashboard Cite

Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm is conveyed through the incongruity between the text and visual entities. Although recent approaches deal with sarcasm as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the underlying sense of irony. In this paper, we propose a novel problem -- Multimodal Sarcasm Explanation (MuSE) -- given a multimodal sarcastic post containing an image and a caption, we aim to generate a natural language explanation to reveal the intended sarcasm. To this end, we develop MORE, a new dataset with explanation of 3510 sarcastic multimodal posts. Each explanation is a natural language (English) sentence describing the hidden irony. We benchmark MORE by employing a multimodal Transformer-based architecture. It incorporates a cross-modal attention in the Transformer's encoder which attends to the distinguishing features between the two modalities. Subsequently, a BART-based auto-regressive decoder is used as the generator. Empirical results demonstrate convincing results over various baselines (adopted for MuSE) across five evaluation metrics. We also conduct human evaluation on predictions and obtain Fleiss' Kappa score of 0.4 as a fair agreement among 25 evaluators.

show abstract

“…Recently, research into multimodal abstractive summarization (MAS) has provided approaches for integrating image and text modalities into a short, concise and readable textual summary [ 2 , 3 ]. With the rapid development of deep learning technologies, more and more researchers have explored various methods for solving this task in unsupervised [ 4 , 5 ] or supervised [ 3 , 6 , 7 ] approaches. In general, the current deep-learning-based schemes are inseparable from the extracting feature then downstream processing [ 8 ] paradigm.…”

Section: Introductionmentioning

confidence: 99%

“…Current research focuses more on processes of the multimodal fusion and textual generation steps instead of feature extraction, as the feature extractors have already been widely used in the fields of natural language processing (NLP) and computer vision (CV) and obtain good performance. In approaches of multimodal fusion, multiple inputs are fused by attention-based [ 9 ] or gate-based [ 3 ] mechanisms in order to learn a representation that is suitable for summary generation. Such solutions concentrate on aggregating features from several modalities.…”

Section: Introductionmentioning

confidence: 99%

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Zhang²,

Wang

et al. 2022

Entropy

View full text Add to dashboard Cite

Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties within each modality and semantic information from cross-modal correlations result in the poor quality of learned representations. Therefore, we propose a novel Inter- and Intra-modal Contrastive Hybrid learning framework which learns to automatically align the multimodal information and maintains the semantic consistency of input/output flows. Moreover, ITCH can be taken as a component to make the model suitable for both supervised and unsupervised learning approaches. Experiments on two public datasets, MMS and MSMO, show that the ITCH performances are better than the current baselines.

show abstract

Multistage Fusion with Forget Gate for Multimodal Summarization in Open-Domain Videos

Cited by 26 publications

References 27 publications

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

MCR: Multilayer cross‐fusion with reconstructor for multimodal abstractive summarisation

Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Contact Info

Product

Resources

About