Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Huang, Yan; Long, Yang; Wang, Liang

doi:10.1609/aaai.v33i01.33018489

Cited by 23 publications

(12 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies on image-caption retrieval have employed cross-model attentions to pay attention to concepts shared by a query and a target [11], [12], [16]- [19]. Crossmodal attentions are performed at an early phase over multiple cropped regions of an image and words of a caption.…”

Section: Contribution Of Proposed Methodsmentioning

confidence: 99%

“…Cutting-edge methods for image-caption retrieval sometimes employ an object detector and a cross-modal attention [11], [12], [16]- [19]. A pretrained object detector crops multiple subregions in an image, and then a cross-modal attention is performed over the cropped regions and words in a caption.…”

Section: Cross-model Attentionmentioning

confidence: 99%

“…Moreover, recent studies involving SCAN [11], GVSE [19], and VSRN [50] reported the results of a twomodel ensemble. In this experimental setting, TOD-Net also improves performance significantly.…”

Section: Comparison With State-of-the-art Modelsmentioning

confidence: 99%

“…By that means, TOD-Net provides new similarities between entities under a condition while preserving the topological relations between them. • Unlike existing methods based on an object detector and a cross-modal attention [11], [12], [16]- [19], TOD-Net is applied to an embedding space and improves the retrieval performance even when using a single-image encoder. This fact indicates that a singleimage encoder already extracts detailed concepts from an image or caption but encounters a difficulty in expressing their relations in the embedding space.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Target-Oriented Deformation of Visual-Semantic Embedding Space

Matsubara

2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Section: Contribution Of Proposed Methodsmentioning

confidence: 99%

Section: Cross-model Attentionmentioning

confidence: 99%

Section: Comparison With State-of-the-art Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Target-Oriented Deformation of Visual-Semantic Embedding Space

Matsubara

2021

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

“…Existing work has focused on the matching between text words and image entities, and retrieves them by capturing the co-occurrence relationship between certain words in a text and certain entities in a image [7]. However, when the text is long with multiple sentences or even paragraphs (e.g.…”

Section: Introductionmentioning

confidence: 99%

Event-Driven Network for Cross-Modal Retrieval

Zeng

Mao

2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

View full text Add to dashboard Cite

Despite extensive research on cross-modal retrieval, existing methods focus on the matching between image objects and text words. However, for the large amount of social media, such as news reports and online posts with images, previous methods are insufficient to model the associations between long text and image. As long text contains multiple entities and relationships between them, as well as complex events sharing a common scenario of the text, it poses unique research challenge to cross-modal retrieval. To tackle the challenge, in this paper, we focus on the retrieval task on long text and image, and propose an event-driven network for cross-modal retrieval. Our approach consists of two modules, namely the contextual neural tensor network (CNTN) and cross-modal matching network (CMMN). The CNTN module captures both event-level and text-level semantics of the sequential events extracted from a long text. The CMMN module learns a common representation space to compute the similarity of image and text modalities. We construct a multimodal dataset based on the news reports in People's Daily. The experimental results demonstrate that our model outperforms the existing state-of-the-art methods and can provide semantic richer text representations to enhance the effectiveness in cross-modal retrieval.

show abstract

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Ebaid

Madbouly

El-Zoghabi

2023

Int J Comput Intell Syst

View full text Add to dashboard Cite

Nowadays, image–text matching (retrieval) has frequently attracted attention due to the growth of multimodal data. This task returns the relevant images to a textual query or descriptions that describe a visual scene and vice versa. The core challenge is how to precisely determine the similarity computation between the text and image, which requires understanding the different modalities by extracting the related information accurately. Although many approaches are established for matching textual data and visual content utilizing deep learning (DL) approaches, a few reviews of the studies of image–text matching are obtainable using DL. In this review study, we contribute to present and clarify the modern techniques based on DL in the image–text matching problem by providing an extensive study of the existing matching models, different current architectures, benchmark datasets, and evaluation methods. First, we explain the matching task and illustrate frequently used architecture. Second, we classify present approaches according to two important concepts the alignment between image and text, and the learning approach. Third, we report standard datasets and evaluation techniques. Finally, we show up current challenges to serve as an inspiration to new researchers in this field.

show abstract

Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Cited by 23 publications

References 16 publications

Target-Oriented Deformation of Visual-Semantic Embedding Space

Target-Oriented Deformation of Visual-Semantic Embedding Space

Event-Driven Network for Cross-Modal Retrieval

Bi-directional Image–Text Matching Deep Learning-Based Approaches: Concepts, Methodologies, Benchmarks and Challenges

Contact Info

Product

Resources

About