Illumination-Guided Transformer-Based Network for Multispectral Pedestrian Detection

Chu, Fuchen; Cao, Jiale; Shao, Zhuang; Pang, Yanwei

doi:10.1007/978-3-031-20497-5_28

Cited by 8 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We assessed our innovative approach on three publicly available datasets and the results show that our method surpassed state-of-theart methods by a wide margin in terms of mean Average Precision. Owing to its application portability, in our future work, we will look at applying our model to different close tasks such as image segmentation [4], event detection [49], object detection [50], [51], pedestrian detection [52], [53], pedestrian attribute recognition [54], person search [45], [55], 3D model retrieval [56], [57], zero-shot learning [58], and magnetic resonance imaging [59] though there should be some adjustment on DCMSTRD and MSLD modules according to adjust to the requirements of each specific task.…”

Section: Discussionmentioning

confidence: 99%

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Shao,

Han,

Debattista

et al. 2024

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Discussionmentioning

confidence: 99%

DCMSTRD: End-to-end Dense Captioning via Multi-Scale Transformer Decoding

Shao,

Han,

Debattista

et al. 2024

IEEE Trans. Multimedia

Self Cite

View full text Add to dashboard Cite

show abstract

“…To begin with, for feature representation of both 2D images and 3D models, a better backbone is always encouraged, which draws our attention to the trendy vision transformers (ViT) recently. It has proved to be a success in many relative computer vision and natural language processing (NLP) such as video event detection [16], pedestrian detection [17], person search [18,19], and text classification [20]. ViT takes the image patch or word embedding as a sequence of tokens, and applies the self-attention mechanism to capture the internal relationships thus obtaining strong feature representation connected with downstream tasks.…”

Section: Introductionmentioning

confidence: 99%

View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer

Chang,

Zhang,

Shao

2023

Multimedia Systems

Self Cite

View full text Add to dashboard Cite

Unsupervised 2D image-based 3D model retrieval aims at retrieving images from the gallery of 3D models by the given 2D images. Despite the encouraging progress made in this task, there are still two significant limitations: (1) feature alignment of 2D images and 3D model gallery is still difficult due to the huge gap between the two modalities. (2) The important view information in the 3D model gallery was ignored by the prior arts, which led to inaccurate results. To alleviate these limitations, inspired by the success of vision transformers (ViT) in a great variety of vision tasks, in this paper, we propose an end-to-end 3D model retrieval architecture on top of ViT, termly transformer-based 3D model retrieval network (T3DRN). In addition, to take advantage of the valuable view information of 3D models, we present an attentive module in T3DRN named shared view-guided attentive module (SVAM) to guide the learning of the alignment features. The proposed method is tested on the challenging dataset, MI3DOR-1. The extensive experimental results have proved the superiority of our proposed method to state-of-the-art methods.

show abstract