Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Wang, Sijin; Wang, Ruiping; Yao, Ziwei; Shan, Shiguang; Chen, Xilin

doi:10.1109/wacv45572.2020.9093614

Cited by 187 publications

(112 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our similarity function is adopted from [32] in which we match the object and relation matrices of a text to those of an image, respectively. Regarding the object matrices, we take the score of the most relevant object in the M I O with each of the objects in the M Q O .…”

Section: Similarity Scorementioning

confidence: 99%

Graph-Based Indexing and Retrieval of Lifelog Data

Nguyen

Gurrin

2021

MultiMedia Modeling

View full text Add to dashboard Cite

Understanding the relationship between objects in an image is an important challenge because it can help to describe actions in the image. In this paper, a graphical data structure, named "Scene Graph", is utilized to represent an encoded informative visual relationship graph for an image, which we suggest has a wide range of potential applications. This scene graph is applied and tested in the popular domain of lifelogs, and specifically in the challenge of known-item retrieval from lifelogs. In this work, every lifelog image is represented by a scene graph, and at retrieval time, this scene graph is compared with the semantic graph, parsed from a textual query. The result is combined with location or date information to determine the matching items. The experiment shows that this technique can outperform a conventional method.

show abstract

Section: Similarity Scorementioning

confidence: 99%

Graph-Based Indexing and Retrieval of Lifelog Data

Nguyen

Gurrin

2021

MultiMedia Modeling

View full text Add to dashboard Cite

show abstract

“…Context information plays a pivotal role in understanding a sentence for many natural language processing tasks, such as neural machine translation [4,32], text summarization [29], and question answering [27]. Analogously, visual contextual relationships can contribute to obtaining fine-grained image region representations, which would benefit various tasks including image captioning [41], VQA [7], and image-text matching [17,31,35,40]. To exploit visual and textual context and capture implicit relations among intra-modal fragments, researchers have presented some structured models for different multi-modal tasks.…”

Section: Intra-modal Context Modelingmentioning

confidence: 99%

“…In the field of image-text matching, Li et al [17] performed local-global semantic reasoning by using Graph Convolutional Network (GCN) and Gated Recurrent Unit. For learning comprehensive representations, Wang et al [35] and Shi et al [31] refined visual relationships by leveraging external scene graphs [13]. Wu et al [40] considered fragment relations in images and texts to obtain self-attention embeddings, acquiring promising intra-modal context modeling.…”

Section: Intra-modal Context Modelingmentioning

confidence: 99%

“…To justify the effectiveness of our proposal, we compared our proposed CAMERA with the following state-of-the-art baselines: VQA [20], sm-LSTM [11], VSE++ [5], CMPM + CMPC [44], SCAN [16], SCG [31], TIMAM [30], SAEM [40], CAMP [37], BFAN [21], PFAN [36], SAN [12], VSRN [17], and SGM [35]. Note that we directly quoted the results from their original papers.…”

Section: Performance Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Context-Aware Multi-View Summarization Network for Image-Text Matching

Qu¹,

Cao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multiview summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines. CCS CONCEPTS • Information systems → Multimedia and multimodal retrieval; Novelty in information retrieval.

show abstract

“…给定一组候选的图像或视频以及自然语言查询语句, 图像或视频检索的目标是选择和查询语句最匹配的图像或视频. 在图像检索方面, Wang 等 [37] 提出了利用图结构来对图像中的物体之间的关系以及文本进行建模, 着重挖掘图像和文本之间的对齐的关系. Lee 等 [38] 将注意力机制应用到图像检索的问题中, 先提取图像和句子的特征, 然后对每个区域和词语对应地使用注意力机制, 然后再计算相似度, 利用注意力机制进行更加准确的对齐.…”

Section: 基于查询文本的图像或视频检索unclassified

Cross-modal video moment retrieval based on visual-textual relationship alignment

Chen¹,

Du²,

Wu³

et al. 2020

Sci. Sin.-Inf.

View full text Add to dashboard Cite

In recent years, increasing amounts of video resources have created a series of demands for fine retrieval of video moments, such as highlight moments in sports events and the recreation of specific video content. In this context, research on cross-modal video segment retrieval, which attempts to output a video moment that matches the input query text, is gradually emerging. Existing solutions primarily focus on global or local feature representation for query text and video moments. However, such solutions ignore matching semantic relations contained in query text and video moments. For example, given the query text "a person is playing basketball", existing retrieval systems may incorrectly return a video moment of "a person holding a basketball" without the considering the semantic relationship of "a person playing basketball". Therefore, this paper proposes a crossmodal relationship alignment framework, which we refer to as CrossGraphAlign, for cross-modal video moment retrieval. The proposed framework constructs a textual relationship graph and a visual relationship graph to model the query semantics in text and video segment relations, and then evaluates the similarity between text relations and visual relations through cross-modally aligned graph convolutional networks to help construct a more accurate video moment retrieval system. Experimental results on the publicly available cross-modal video retrieval datasets TACoS and ActivityNet Captions demonstrate that the proposed method can effectively utilize the semantic relationships to improve the recall rate in cross-modal video moment retrieval.

show abstract

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Cited by 187 publications

References 27 publications

Graph-Based Indexing and Retrieval of Lifelog Data

Graph-Based Indexing and Retrieval of Lifelog Data

Context-Aware Multi-View Summarization Network for Image-Text Matching

Cross-modal video moment retrieval based on visual-textual relationship alignment

Contact Info

Product

Resources

About