Deep High-Resolution Representation Learning for Visual Recognition

Wang, Jingdong; Sun, Ke; Cheng, Tianheng; Jiang, Borui; Deng, Chaorui; Zhao, Yang; Liu, Dong; Mu, Yadong; Tan, Mingkui; Wang, Xinggang; Liu, Wenyu; Xiao, Bin

doi:10.1109/tpami.2020.2983686

Cited by 2,163 publications

(1,267 citation statements)

References 135 publications

Supporting

Mentioning

1,263

Contrasting

Unclassified

Order By: Relevance

“…Object detection has attracted a great deal of attention in recent years [4,13,14,16,19,20,27,28,30,38,39,43,47,48,56]. One popular direction for recent object detection is proposal-based object detectors (a.k.a.…”

Section: Related Workmentioning

confidence: 99%

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

Tang

Wang

Bai

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

309

290

View full text Add to dashboard Cite

In this paper, we focus on semi-supervised object detection to boost accuracies of proposal-based object detectors (a.k.a. two-stage object detectors) by training on both labeled and unlabeled data. However, it is non-trivial to train object detectors on unlabeled data due to the unavailability of ground truth labels. To address this problem, we present a proposal learning approach to learn proposal features and predictions from both labeled and unlabeled data. The approach consists of a self-supervised proposal learning module and a consistency-based proposal learning module. In the self-supervised proposal learning module, we present a proposal location loss and a contrastive loss to learn context-aware and noise-robust proposal features respectively. In the consistency-based proposal learning module, we apply consistency losses to both bounding box classification and regression predictions of proposals to learn noise-robust proposal features and predictions. Experiments are conducted on the COCO dataset with all available labeled and unlabeled data. Results show that our approach consistently improves the accuracies of fullysupervised baselines. In particular, after combining with data distillation [37], our approach improves AP by about 2.0% and 0.9% on average compared with fully-supervised baselines and data distillation baselines respectively.

show abstract

Section: Related Workmentioning

confidence: 99%

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

Tang

Wang

Bai

et al. 2020

IEEE Trans. Pattern Anal. Mach. Intell.

Self Cite

309

290

View full text Add to dashboard Cite

show abstract

“…The network depth is of crucial importance for challenging image classification problems [30]. Many deep neural network architectures (which use hundreds of layers) such as ResNet [31], ResNext [57] or HRnet [58], have provided an outstanding performance in varied image datasets with multitude of different objects [59]. These complex architectures are usually exploited in a pre-trained fashion [60], which saves computational efforts and allows different domains to take advantage of their prediction capabilities when the scarcity of annotated examples invalidates the training of such models from scratch [61]- [63].…”

Section: ) Convolutional Neural Networkmentioning

confidence: 99%

Galaxy Image Classification Based on Citizen Science Data: A Comparative Study

et al. 2020

View full text Add to dashboard Cite

Many research fields are now faced with huge volumes of data automatically generated by specialised equipment. Astronomy is a discipline that deals with large collections of images difficult to handle by experts alone. As a consequence, astronomers have been relying on the power of the crowds, as a form of citizen science, for the classification of galaxy images by amateur people. However, the new generation of telescopes that will produce images at a higher rate highlights the limitations of this approach, and the use of machine learning methods for automatic classification is considered essential. The goal of this paper is to shed light on the automated classification of galaxy images exploring two distinct machine learning strategies. First, following the classical approach consisting of feature extraction together with a classifier, we compare the state-of-the-art feature extractor for this problem, the WND-CHARM, with our proposal based on autoencoders for feature extraction on galaxy images. We then compare these results with an end-to-end classification using convolutional neural networks. To better leverage the available citizen science data, we also investigate a pre-training scheme that exploits both amateur-and expert-labelled data. Our experiments reveal that autoencoders greatly speed up feature extraction in comparison with WND-CHARM and both classification strategies, either using convolutional neural networks or feature extraction, reach comparable accuracy. The use of pre-training in convolutional neural networks, however, has allowed us to provide even better results.

show abstract

“…Inspired by High-Resolution Network [22], we develop a carefully modified HRNet containing three stages as the shared backbone network, which can be end-to-end trained. The input is first fed into a stem consisting of two 3 × 3 convolutions with stride 2 for resolution reduction, and subsequently transmitted the main body that includes parallel multi-branch convolutions with different resolutions.…”

Section: High-resolution Networkmentioning

confidence: 99%

“…C in each residual unit is the number of channels. Following the design of HRNet [22], we gradually append high-to low-resolution streams, forming the new stage consisting of the previous resolution and an extra lower one, and connect the multi-resolution branches in parallel. The advantage is that the resulting representation is more precise spatial location and richer semantic information.…”

Section: High-resolution Networkmentioning

confidence: 99%

See 1 more Smart Citation

HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

Zhang

Zheng

Wang

et al. 2020

Sensors

View full text Add to dashboard Cite

Siamese network-based trackers consider tracking as features cross-correlation between the target template and the search region. Therefore, feature representation plays an important role for constructing a high-performance tracker. However, all existing Siamese networks extract the deep but low-resolution features of the entire patch, which is not robust enough to estimate the target bounding box accurately. In this work, to address this issue, we propose a novel high-resolution Siamese network, which connects the high-to-low resolution convolution streams in parallel as well as repeatedly exchanges the information across resolutions to maintain high-resolution representations. The resulting representation is semantically richer and spatially more precise by a simple yet effective multi-scale feature fusion strategy. Moreover, we exploit attention mechanisms to learn object-aware masks for adaptive feature refinement, and use deformable convolution to handle complex geometric transformations. This makes the target more discriminative against distractors and background. Without bells and whistles, extensive experiments on popular tracking benchmarks containing OTB100, UAV123, VOT2018 and LaSOT demonstrate that the proposed tracker achieves state-of-the-art performance and runs in real time, confirming its efficiency and effectiveness.

show abstract

Deep High-Resolution Representation Learning for Visual Recognition

Cited by 2,163 publications

References 135 publications

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

Galaxy Image Classification Based on Citizen Science Data: A Comparative Study

HROM: Learning High-Resolution Representation and Object-Aware Masks for Visual Object Tracking

Contact Info

Product

Resources

About