Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

Chen, Dapeng; Li, Hongsheng; Liu, Xihui; Shen, Yantao; Shao, Jing; Yuan, Zejian; Wang, Xiaogang

doi:10.1007/978-3-030-01270-0_4

Cited by 129 publications

(77 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…And they only pay attention to one single direction when using the fine-grained matching or attention scheme for representation enhancement, i.e., only using text for weighting different visual components. Chen et al [5] improve visual representations by global and local cross-modal associations. The global image-language association is established according to the identity labels, and the local association focuses on improving the visual representations by phrase reconstruction.…”

Section: Description-based Person Re-identificationmentioning

confidence: 99%

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Niu

Huang

Ouyang

et al. 2020

IEEE Trans. on Image Process.

129

View full text Add to dashboard Cite

Description-based person re-identification (Re-id) is an important task in video surveillance that requires discriminative cross-modal representations to distinguish different people. It is difficult to directly measure the similarity between images and descriptions due to the modality heterogeneity (the crossmodal problem). And all samples belonging to a single category (the fine-grained problem) makes this task even harder than the conventional image-description matching task. In this paper, we propose a Multi-granularity Image-text Alignments (MIA) model to alleviate the cross-modal fine-grained problem for better similarity evaluation in description-based person Re-id. Specifically, three different granularities, i.e., global-global, global-local and local-local alignments are carried out hierarchically. Firstly, the global-global alignment in the Global Contrast (GC) module is for matching the global contexts of images and descriptions. Secondly, the global-local alignment employs the potential relations between local components and global contexts to highlight the distinguishable components while eliminating the uninvolved ones adaptively in the Relation-guided Global-local Alignment (RGA) module. Thirdly, as for the local-local alignment, we match visual human parts with noun phrases in the Bi-directional Fine-grained Matching (BFM) module. The whole network combining multiple granularities can be end-to-end trained without complex preprocessing. To address the difficulties in training the combination of multiple granularities, an effective step training strategy is proposed to train these granularities step-by-step. Extensive experiments and analysis have shown that our method obtains the state-of-the-art performance on the CUHK-PEDES dataset and outperforms the previous methods by a significant margin.

show abstract

Section: Description-based Person Re-identificationmentioning

confidence: 99%

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Niu

Huang

Ouyang

et al. 2020

IEEE Trans. on Image Process.

129

View full text Add to dashboard Cite

show abstract

“…(3) Sentenceaware context object erasing, where we erase a dominant context region, based on the sentence-aware object-level attention weights over context objects. Note that (2) and (3) are two complementary approaches for sentence-aware visual erasing. With training samples generated online by the erasing operation, the model cannot access the most dominant information, and is forced to further discover complementary textual-visual correspondences previously ignored.…”

Section: Introductionmentioning

confidence: 99%

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Liu

Wang²,

Shao³

et al. 2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

157

View full text Add to dashboard Cite

Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.

show abstract

“…We compared our approach with following nine state-of-the-art (SOTA) approaches: CNN-RNN [ 44 ], NeuralTalk [ 45 ], GNA-RNN [ 6 ], Latent Co-attention [ 31 ], PWM + ATH [ 32 ], GLA [ 46 ], Dual Path [ 17 ], CMPM + CMPC [ 7 ], and PMA [ 8 ].…”

Section: Experiments and Discussionmentioning

confidence: 99%

Hybrid Attention Network for Language-Based Person Search

Xiao

2020

Sensors

View full text Add to dashboard Cite

Language-based person search retrieves images of a target person using natural language description and is a challenging fine-grained cross-modal retrieval task. A novel hybrid attention network is proposed for the task. The network includes the following three aspects: First, a cubic attention mechanism for person image, which combines cross-layer spatial attention and channel attention. It can fully excavate both important midlevel details and key high-level semantics to obtain better discriminative fine-grained feature representation of a person image. Second, a text attention network for language description, which is based on bidirectional LSTM (BiLSTM) and self-attention mechanism. It can better learn the bidirectional semantic dependency and capture the key words of sentences, so as to extract the context information and key semantic features of the language description more effectively and accurately. Third, a cross-modal attention mechanism and a joint loss function for cross-modal learning, which can pay more attention to the relevant parts between text and image features. It can better exploit both the cross-modal and intra-modal correlation and can better solve the problem of cross-modal heterogeneity. Extensive experiments have been conducted on the CUHK-PEDES dataset. Our approach obtains higher performance than state-of-the-art approaches, demonstrating the advantage of the approach we propose.

show abstract

Improving Deep Visual Representation for Person Re-identification by Global and Local Image-language Association

Cited by 129 publications

References 64 publications

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Hybrid Attention Network for Language-Based Person Search

Contact Info

Product

Resources

About