With the rapid development of deep learning techniques, the performance of object detection has increased significantly. Recently, several approaches on joint learning of object detection and semantic segmentation have been proposed to exploit the complementary benefits of the two highly correlated tasks. In this work, we propose a weakly-annotated auxiliary multi-label segmentation network that boosts object detection performance without additional computational cost at inference. The proposed auxiliary segmentation network is trained using weakly-annotated dataset and therefore does not require expensive pixel-level annotations for training. Different from the previous approaches, we use multi-label segmentation to jointly supervise auxiliary segmentation and object detection for better occlusion handling. The proposed method can be integrated with any one-stage object detector such as RetinaNet, YOLOv3, YOLOv4, or SSD. Our experimental results on the MS COCO dataset show that the proposed method can improve the performance of popular one-stage object detectors without slowing down the inference speed regardless of the sub-optimal training sample selection schemes.
INDEX TERMSDeep learning, Multi-task learning, Object detection, Semantic segmentation 24 tion framework and train the model with multi-task loss 25 functions [9], [10], [44]. RON [9] utilizes the output of an 26 auxiliary task as an attention map to enhance the feature maps 27 for object detection. In [10], a segmentation infusion network 28 is proposed to enable joint supervision of semantic segmenta-29 tion and pedestrian detection. [44] proposes a set of auxiliary 30 tasks to help improve the accuracy of object detection. In 31 these approaches, the auxiliary branches will be removed at 32 the inference stage and therefore, the detection speed will 33 not be affected. However, there are several drawbacks in the 34 existing methods. Typically, the approaches using joint train-35 ing of detection and semantic segmentation require expensive 36 pixel-level image annotations. In addition, as in [10], [44],
Transformer-based semantic segmentation methods have achieved excellent performance in recent years. Mask2Former is one of the well-known transformer-based methods which unifies common image segmentation into a universal model. However, it performs relatively poorly in obtaining local features and segmenting small objects due to relying heavily on transformers. To this end, we propose a simple yet effective architecture that introduces auxiliary branches to Mask2Former during training to capture dense local features on the encoder side. The obtained features help improve the performance of learning local information and segmenting small objects. Since the proposed auxiliary convolution layers are required only for training and can be removed during inference, the performance gain can be obtained without additional computation at inference. Experimental results show that our model can achieve state-of-the-art performance (57.6% mIoU) on the ADE20K and (84.8% mIoU) on the Cityscapes datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.