The task of image multi-label classification is to accurately recognize multiple objects in an input image. Most of the recent works need to leverage the label co-occurrence matrix counted from training data to construct the graph structure, which are inflexible and may degrade model generalizability. In addition, these methods fail to capture the semantic correlation between the channel feature maps to further improve the model performance. To address these issues, we propose a D ouble A ttention framework based on G raph A ttention ne T work (DA-GAT) to effectively learn the correlation between labels from training data. Firstly, we devise a new channel attention mechanism to enhance the semantic correlation between channel feature maps, so as to implicitly capture the correlation between labels. Secondly, we propose a new label attention mechanism to avoid the adverse impact of manually constructed label co-occurrence matrix. It only needs to leverage the label embedding as the input of network, and then automatically constructs the label relation matrix to explicitly establish the correlation between labels. Finally, we effectively fuse the output of these two attention mechanisms to further improve the model performance. Extensive experiments are conducted on three public multi-label classification benchmarks. Our DA-GAT model achieves the mAPs of 87.1%, 96.6%, and 64.3% on MS-COCO 2014, PASCAL VOC 2007, and NUS-WIDE respectively, and obviously outperforms other existing state-of-the-art methods. In addition, visual analysis experiments demonstrate that each attention mechanism can capture the correlation between labels well, and significantly promote the model performance.
Image multi-label classification task is mainly to correctly predict multiple object categories in the images. To capture the correlation between labels, graph convolution network based methods have to manually count the label co-occurrence probability from training data to construct a pre-defined graph as the input of graph network, which is inflexible and may degrade model generalizability. Moreover, most of the current methods cannot effectively align the learned salient object features with the label concepts, so that the predicted results of model may not be consistent with the image content. Therefore, how to learn the salient semantic features of images and capture the correlation between labels, and then effectively align them is one of the key to improve the performance of image multi-label classification task. To this end, we propose a novel image multi-label classification framework which aims to align I mage S emantics with L abel C oncepts ( ISLC ). Specifically, we propose a residual encoder to learn salient object features in the images, and exploit the self-attention layer in aligned decoder to automatically capture the correlation between labels. Then, we leverage the cross-attention layers in aligned decoder to align image semantic features with label concepts, so as to make the labels predicted by model more consistent with image content. Finally, the output features of the last layer of residual encoder and aligned decoder are fused to obtain the final output feature for classification. The proposed ISLC model achieves good performance on various prevalent multi-label image datasets such as MS-COCO 2014, PASCAL VOC 2007, VG-500 and NUS-WIDE with 87.2%, 96.9%, 39.4% and 64.2% respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.