2020
DOI: 10.1109/access.2020.2996407
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Attentive Multi-Modal Embedding Learning for Image-Text Matching

Abstract: Matching the image and text with deep models has been extensively studied in recent years. Mining the correlation between image and text to learn effective multi-modal features is crucial for image-text matching. However, most existing approaches model the different types of correlation independently. In this work, we propose a novel model named Adversarial Attentive Multi-modal Embedding Learning (AAMEL) for image-text matching. It combines adversarial networks and attention mechanism to learn effective and r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(5 citation statements)
references
References 51 publications
(67 reference statements)
0
5
0
Order By: Relevance
“…We compare our TERAN method against the following baselines: JGCAR [55], SAN [21], VSE++ [9], SMAN [22], M3A-Net [20], AAMEL [57], MRNN [24], SCAN [28], SAEM [59], CASC [60], MMCA [58], VSRN [30], PFAN [56], Full-IMRAM [4], and CAMERA [46]. We clustered these methods based on the visual feature extractor they use: VGG, ResNet, or Region CNN (e.g., Faster-RCNN).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…We compare our TERAN method against the following baselines: JGCAR [55], SAN [21], VSE++ [9], SMAN [22], M3A-Net [20], AAMEL [57], MRNN [24], SCAN [28], SAEM [59], CASC [60], MMCA [58], VSRN [30], PFAN [56], Full-IMRAM [4], and CAMERA [46]. We clustered these methods based on the visual feature extractor they use: VGG, ResNet, or Region CNN (e.g., Faster-RCNN).…”
Section: Resultsmentioning
confidence: 99%
“…Recent works [4,13,15,28,36,37,56] exploit the availability of pre-computed region-level features extracted from the Faster-RCNN [47] object detector. An alternative consists in using the features maps in output from ResNets, without aggregating them, for computing fine-grained attentions over the sentences [11,18,22,55,57,60].…”
Section: Image-text Processing For Cross-modal Retrievalmentioning
confidence: 99%
“…We choose the latest work in past two years as baseline methods for comparison with Global Relation-aware Attention Network (GRAN). Including SCAN [19], ACMNet [5], CASC [6], DP-RNN [9], MMCA [39], CAAN [44], IMRAM [7], AAMEL [38], SMAN [16], M3A-Net [15] which use cross-related methods; SGM [35], Guo et al [12] which use GCN [18]; Polynomial Loss [37], AMF [26], Chen et al [8] which introduce new loss; and TERAN [24] which uses transformer. @ ( = 1, 5, 10) is adopted to evaluate the cross-modal retrieval performance of all methods.…”
Section: Baseline Methods and Evaluation Metricsmentioning
confidence: 99%
“…These methods can be roughly categorized into three groups, methods that use network fusion (concatenation) [1,2,3], methods that use gating [4], and methods that use cross-modal training [5]. For time series, it is especially common to combine text with images in document recognition [3,4], natural scene image recognition [1], and cross-modal retrieval [6,7]. Combining audio with video is another common use for multi-modal networks [2,8,9].…”
Section: Related Workmentioning
confidence: 99%