Deep Intra-Image Contrastive Learning for Weakly Supervised One-Step Person Search

Wang, Jiabei; Pang, Yanwei; Cao, Jiale; Sun, Hanqing; Shao, Zhuang; Li, Xuelong

doi:10.48550/arxiv.2302.04607

Cited by 1 publication

(1 citation statement)

References 47 publications

(82 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To begin with, for feature representation of both 2D images and 3D models, a better backbone is always encouraged, which draws our attention to the trendy vision transformers (ViT) recently. It has proved to be a success in many relative computer vision and natural language processing (NLP) such as video event detection [16], pedestrian detection [17], person search [18,19], and text classification [20]. ViT takes the image patch or word embedding as a sequence of tokens, and applies the self-attention mechanism to capture the internal relationships thus obtaining strong feature representation connected with downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer

Chang,

Zhang,

Shao

2023

Multimedia Systems

View full text Add to dashboard Cite

Unsupervised 2D image-based 3D model retrieval aims at retrieving images from the gallery of 3D models by the given 2D images. Despite the encouraging progress made in this task, there are still two significant limitations: (1) feature alignment of 2D images and 3D model gallery is still difficult due to the huge gap between the two modalities. (2) The important view information in the 3D model gallery was ignored by the prior arts, which led to inaccurate results. To alleviate these limitations, inspired by the success of vision transformers (ViT) in a great variety of vision tasks, in this paper, we propose an end-to-end 3D model retrieval architecture on top of ViT, termly transformer-based 3D model retrieval network (T3DRN). In addition, to take advantage of the valuable view information of 3D models, we present an attentive module in T3DRN named shared view-guided attentive module (SVAM) to guide the learning of the alignment features. The proposed method is tested on the challenging dataset, MI3DOR-1. The extensive experimental results have proved the superiority of our proposed method to state-of-the-art methods.

show abstract