Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop 2022
DOI: 10.18653/v1/2022.acl-srw.34
|View full text |Cite
|
Sign up to set email alerts
|

Scene-Text Aware Image and Text Retrieval with Dual-Encoder

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments. Meanwhile, Shumpei Miyawaki et al's [28] dual-encoder model integrates image visual and text semantics into a shared semantic space for efficient offline inference.…”
Section: Auto-encodermentioning
confidence: 99%
See 1 more Smart Citation
“…Additionally, Yin et al's [27] Convolutional Auto-Encoder (CAE) model establishes meaningful correlations among high-level semantic relationships to enhance accuracy in image-text retrieval within multimodal environments. Meanwhile, Shumpei Miyawaki et al's [28] dual-encoder model integrates image visual and text semantics into a shared semantic space for efficient offline inference.…”
Section: Auto-encodermentioning
confidence: 99%
“…This allows readers to efficiently compare research methodologies and dataset types within the same category, as presented in Table 2 and Table 3. Dual-Encoder [28] Graphics and text are encoded independently, ensuring their separate representation TextCaps two-stage learning [25] Preserving semantic features and information through a two-stage process WIKI、MIRFLICKR、NUS-WIDE CNN end-to-end DCCA [31] End-to-end network framework Flickr8K、Flickr30K 、IAPR TC-12 identity-aware two-stage [37] Attention mechanism for identity perception CUHK-PEDES、CUB、Flower M-CNN [29] End-to-end network framework Flickr8K、Flickr30K…”
Section: Datasetsmentioning
confidence: 99%
“…As a fundamental task in visual-language understanding (Xu et al, 2021;Park et al, 2022;Miyawaki et al, 2022), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”
Section: Introductionmentioning
confidence: 99%
“…As a fundamental task in visual-language understanding Xu et al, 2021b;Park et al, 2022a;Miyawaki et al, 2022;Fang et al, 2023a,b;Kim et al, 2023;Jian and Wang, 2023), video-text retrieval (VTR) (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a;Zhao et al, 2022;Gorti et al, 2022;Fang et al, 2022) has attracted interest from academia and industry. Although recent years have witnessed the rapid development of VTR with the support from powerful pretraining models (Luo et al, 2022;Gao et al, 2021b;Ma et al, 2022a;Liu et al, 2022a), improved retrieval methods (Bertasius et al, 2021;Dong et al, 2019;, and videolanguage datasets construction (Xu et al, 2016), it remains challenging to precisely match video and language due to the raw data being in heterogeneous spaces with significant differences.…”
Section: Introductionmentioning
confidence: 99%