2020 IEEE Winter Conference on Applications of Computer Vision (WACV) 2020
DOI: 10.1109/wacv45572.2020.9093614
|View full text |Cite
|
Sign up to set email alerts
|

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
92
0
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 187 publications
(112 citation statements)
references
References 27 publications
0
92
0
1
Order By: Relevance
“…Our similarity function is adopted from [32] in which we match the object and relation matrices of a text to those of an image, respectively. Regarding the object matrices, we take the score of the most relevant object in the M I O with each of the objects in the M Q O .…”
Section: Similarity Scorementioning
confidence: 99%
“…Our similarity function is adopted from [32] in which we match the object and relation matrices of a text to those of an image, respectively. Regarding the object matrices, we take the score of the most relevant object in the M I O with each of the objects in the M Q O .…”
Section: Similarity Scorementioning
confidence: 99%
“…Context information plays a pivotal role in understanding a sentence for many natural language processing tasks, such as neural machine translation [4,32], text summarization [29], and question answering [27]. Analogously, visual contextual relationships can contribute to obtaining fine-grained image region representations, which would benefit various tasks including image captioning [41], VQA [7], and image-text matching [17,31,35,40]. To exploit visual and textual context and capture implicit relations among intra-modal fragments, researchers have presented some structured models for different multi-modal tasks.…”
Section: Intra-modal Context Modelingmentioning
confidence: 99%
“…In the field of image-text matching, Li et al [17] performed local-global semantic reasoning by using Graph Convolutional Network (GCN) and Gated Recurrent Unit. For learning comprehensive representations, Wang et al [35] and Shi et al [31] refined visual relationships by leveraging external scene graphs [13]. Wu et al [40] considered fragment relations in images and texts to obtain self-attention embeddings, acquiring promising intra-modal context modeling.…”
Section: Intra-modal Context Modelingmentioning
confidence: 99%
See 1 more Smart Citation
“…给定一组候选的图像或视频以及自然语言查询语句, 图像或视频检索的目标是选择和查询语句最 匹配的图像或视频. 在图像检索方面, Wang 等 [37] 提出了利用图结构来对图像中的物体之间的关系 以及文本进行建模, 着重挖掘图像和文本之间的对齐的关系. Lee 等 [38] 将注意力机制应用到图像检索 的问题中, 先提取图像和句子的特征, 然后对每个区域和词语对应地使用注意力机制, 然后再计算相 似度, 利用注意力机制进行更加准确的对齐.…”
Section: 基于查询文本的图像或视频检索unclassified