2022
DOI: 10.3390/e24060764
|View full text |Cite
|
Sign up to set email alerts
|

Inter- and Intra-Modal Contrastive Hybrid Learning Framework for Multimodal Abstractive Summarization

Abstract: Internet users are benefiting from technologies of abstractive summarization enabling them to view articles on the internet by reading article summaries only instead of an entire article. However, there are disadvantages to technologies for analyzing articles with texts and images due to the semantic gap between vision and language. These technologies focus more on aggregating features and neglect the heterogeneity of each modality. At the same time, the lack of consideration of intrinsic data properties withi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 55 publications
0
3
0
Order By: Relevance
“…Li et al [ 21 ] have introduced an Inter and Intra modal Contrastive Hybrid (ITCH) framework that uses the automatic alignment of the multimodal information and summarizes it accordingly. ITCH obtains the bi-modal input as text and image to present it in a patch-oriented encoder and textual encoder to extract the features.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al [ 21 ] have introduced an Inter and Intra modal Contrastive Hybrid (ITCH) framework that uses the automatic alignment of the multimodal information and summarizes it accordingly. ITCH obtains the bi-modal input as text and image to present it in a patch-oriented encoder and textual encoder to extract the features.…”
Section: Related Workmentioning
confidence: 99%
“…The samples in R 0 are regarded as negative examples. As such, we follow (Lin et al 2022) to define the pairwise objective function with anchor sample and positive or negative samples L 1 (x a , x a ), a ∈ {t, v}. The final fully-supervised intra-modal contrastive loss is as follows:…”
Section: Rumor Detection With Contrastive Learningmentioning
confidence: 99%
“…Therefore, cross-modal CL is applied to MSA so that the distance between paired image text data and feature space is as close as possible, while the distance between nonpaired image text data and feature space is as far as possible. It is one of the future development directions in MSA to realize the semantic interaction and association of images and texts at different levels [174].…”
Section: B Future Trendsmentioning
confidence: 99%