“…For MS-COCO dataset, the part of the description in the text is too ambiguous, which may cause the model to be insensitive to the features in this part and leads to poor results. But it is comparable to some other models, such as CPGN [15], CAMP [10], and PVSE [12].…”
“…Although this approach can extract high-level semantic information, it has no emphasis on the process of mixing information and does not work well for local matching. Subsequently, [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] focus on the extraction * corresponding author. of local features from images and text , and combining attention mechanisms to achieve local alignment.…”
Section: Introductionmentioning
confidence: 99%
“…However, region features extracted from the image are relatively independent without contextual semantic information as a guarantee, and its semantics will inevitably deviate in the subsequent extraction of region features. Therefore, [11,15] and other approaches propose to fuse the local and global information of the data. [19] uses global features in another form and divided the similarity calculation into 3 levels, local-level, global-level, and relationship-level, to measure data similarity at multiple levels.…”
Image-text retrieval has made great progress, but it remains challenging due to heterogeneity between images and text. Enhancing the interaction by exploring the relationship between the image and text can reduce this problem, to some extent. How to explore and use the relationship between image and text to enhance the interaction between them is a critical problem. In this paper, we design an asymmetric structure network (RGN) to represent image and text. First, we mine the relationship between image and text, and extract the specific text information. Then we exploit this relationship to guide the generation of text embeddings, which can capture the rich and representative embeddings. Results on two datasets, Flickr30K dataset and MSCOCO dataset, show that our model can achieve competitive results.
“…For MS-COCO dataset, the part of the description in the text is too ambiguous, which may cause the model to be insensitive to the features in this part and leads to poor results. But it is comparable to some other models, such as CPGN [15], CAMP [10], and PVSE [12].…”
“…Although this approach can extract high-level semantic information, it has no emphasis on the process of mixing information and does not work well for local matching. Subsequently, [5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22][23] focus on the extraction * corresponding author. of local features from images and text , and combining attention mechanisms to achieve local alignment.…”
Section: Introductionmentioning
confidence: 99%
“…However, region features extracted from the image are relatively independent without contextual semantic information as a guarantee, and its semantics will inevitably deviate in the subsequent extraction of region features. Therefore, [11,15] and other approaches propose to fuse the local and global information of the data. [19] uses global features in another form and divided the similarity calculation into 3 levels, local-level, global-level, and relationship-level, to measure data similarity at multiple levels.…”
Image-text retrieval has made great progress, but it remains challenging due to heterogeneity between images and text. Enhancing the interaction by exploring the relationship between the image and text can reduce this problem, to some extent. How to explore and use the relationship between image and text to enhance the interaction between them is a critical problem. In this paper, we design an asymmetric structure network (RGN) to represent image and text. First, we mine the relationship between image and text, and extract the specific text information. Then we exploit this relationship to guide the generation of text embeddings, which can capture the rich and representative embeddings. Results on two datasets, Flickr30K dataset and MSCOCO dataset, show that our model can achieve competitive results.
“…The core idea of the cross‐media search method based on deep semantics is to learn the complex high‐level features by using deep learning methods to improve the effect of feature learning. Zhang et al 22 incorporate image and text into the latent common space using the Residual Network (ResNet) model to learn the global feature for sentence generation learning. Peng and Qi 23 used bidirectional translation training to directly convert the bidirectional pairs between visual and textual descriptions to capture the cross‐media correlations.…”
The rapid development of the social network has brought great convenience to people's lives. A large amount of cross-media big data, such as text, image, and video data, has been accumulated. A cross-media search can facilitate a quick query of information so that users can obtain helpful content for social networks. However, cross-media data suffer from semantic gaps and sparsity in social networks, which bring challenges to cross-media searches. To alleviate the semantic gaps and sparsity, we propose a crossmedia search method based on complementary attention and generative adversarial networks (CAGS).To obtain high-quality feature representations, we build a complementary attention mechanism containing the focused and unfocused features of images to realize the consistent association of cross-media data in social networks. By designing the cross-media
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.