Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Miyawaki, Shumpei; Hasegawa, Taku; Nishida, Keiya; Kato, Takuma; Suzuki, Jun–ichi

doi:10.18653/v1/2022.acl-srw.34

Cited by 9 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments. Meanwhile, Shumpei Miyawaki et al's [28] dual-encoder model integrates image visual and text semantics into a shared semantic space for efficient offline inference.…”

Section: Auto-encodermentioning

confidence: 99%

“…This allows readers to efficiently compare research methodologies and dataset types within the same category, as presented in Table 2 and Table 3. Dual-Encoder [28] Graphics and text are encoded independently, ensuring their separate representation TextCaps two-stage learning [25] Preserving semantic features and information through a two-stage process WIKI、MIRFLICKR、NUS-WIDE CNN end-to-end DCCA [31] End-to-end network framework Flickr8K、Flickr30K 、IAPR TC-12 identity-aware two-stage [37] Attention mechanism for identity perception CUHK-PEDES、CUB、Flower M-CNN [29] End-to-end network framework Flickr8K、Flickr30K…”

Section: Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

A review of cross-modal retrieval for image-text

Xia,

Yang,

et al. 2024

Fifteenth International Conference on Graphics and Image Processing (ICGIP 2023)

View full text Add to dashboard Cite

With the rapid advancement of Internet technology and the widespread adoption of smart devices, there has been a substantial increase in multimodal data that conveys identical semantics but in diverse coding formats. To foster the advancement of social intelligence, scholars are increasingly investigating the semantic correlations among multimodal data, which represents a current research focal point. The primary objective of cross-modal accurately compute the dissimilar modalities and efficiently retrieve relevant data from other modalities. The objective of this article is to provide comprehensive overview of the advancements in cross-modal retrieval research. First, it presents a conceptual framework and problem formulation for cross-modal retrieval elucidating, the multimodal nature of image and text cross-modal retrieval. Secondly, it delves into semantic representation learning-based approaches for computing imagetext cross-modal similarity and hash-based methods for facilitating cross-modal retrieval. Furthermore, a comparative analysis is conducted on widely adopted evaluation metrics for current cross-modal retrieval techniques, accompanied by outlook on future research directions.

show abstract

Section: Auto-encodermentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

A review of cross-modal retrieval for image-text

Xia,

Yang,

et al. 2024

Fifteenth International Conference on Graphics and Image Processing (ICGIP 2023)

View full text Add to dashboard Cite

show abstract

“…As a fundamental task in visual-language understanding (Xu et al, 2021;Park et al, 2022;Miyawaki et al, 2022), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”

Section: Introductionmentioning

confidence: 99%

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Wang¹,

Shi²

2023

Preprint

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-space multi-grained supervised learning framework, SUMA, to learn an aligned representation space shared between the video and the text for video-text retrieval. The shared aligned space is initialized with a finite number of concept clusters, each of which refers to a number of basic concepts (words). With the text data at hand, we are able to update the shared aligned space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarity. Benefiting from learned shared aligned space and multi-grained similarity, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of SUMA over existing methods.

show abstract

“…As a fundamental task in visual-language understanding Xu et al, 2021b;Park et al, 2022a;Miyawaki et al, 2022;Fang et al, 2023a,b;Kim et al, 2023;Jian and Wang, 2023), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022;Fang et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”

Section: Introductionmentioning

confidence: 99%

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

Wang,

Shi

2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

While recent progress in video-text retrieval has been advanced by the exploration of better representation learning, in this paper, we present a novel multi-grained sparse learning framework, S3MA, to learn an aligned sparse space shared between the video and the text for video-text retrieval. The shared sparse space is initialized with a finite number of sparse concepts, each of which refers to a number of words.With the text data at hand, we learn and update the shared sparse space in a supervised manner using the proposed similarity and alignment losses. Moreover, to enable multi-grained alignment, we incorporate frame representations for better modeling the video modality and calculating fine-grained and coarse-grained similarities. Benefiting from the learned shared sparse space and multi-grained similarities, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of S3MA over existing methods. Our code is available at link.

show abstract

Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Cited by 9 publications

References 0 publications

A review of cross-modal retrieval for image-text

A review of cross-modal retrieval for image-text

Video-Text Retrieval by Supervised Multi-Space Multi-Grained Alignment

Video-Text Retrieval by Supervised Sparse Multi-Grained Learning

Contact Info

Product

Resources

About