Yuning Jiang scite author profile

Humans recognize the visual world at multiple levels: we effortlessly categorize scenes and detect objects inside, while also identifying the textures and surfaces of the objects along with their different compositional parts. In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. A multi-task framework called UPerNet and a training strategy are developed to learn from heterogeneous image annotations. We benchmark our framework on Unified Perceptual Parsing and show that it is able to effectively segment a wide range of concepts from images. The trained networks are further applied to discover visual knowledge in natural scenes 1 .

show abstract

Acquisition of Localization Confidence for Accurate Object Detection

Jiang

et al. 2018

View full text Add to dashboard Cite

Modern CNN-based object detectors rely on bounding box regression and non-maximum suppression to localize objects. While the probabilities for class labels naturally reflect classification confidence, localization confidence is absent. This makes properly localized bounding boxes degenerate during iterative regression or even suppressed during NMS. In the paper we propose IoU-Net learning to predict the IoU between each detected bounding box and the matched ground-truth. The network acquires this confidence of localization, which improves the NMS procedure by preserving accurately localized bounding boxes. Furthermore, an optimization-based bounding box refinement method is proposed, where the predicted IoU is formulated as the objective. Extensive experiments on the MS-COCO dataset show the effectiveness of IoU-Net, as well as its compatibility with and adaptivity to several state-of-the-art object detectors.

show abstract

Repulsion Loss: Detecting Pedestrians in a Crowd

Wang

Xiao

Jiang³

et al. 2018

460

343

View full text Add to dashboard Cite

Detecting individual pedestrians in a crowd remains a challenging problem since the pedestrians often gather together and occlude each other in real-world scenarios. In this paper, we first explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion via experimentation, providing insights into the crowd occlusion problem. Then, we propose a novel bounding box regression loss specifically designed for crowd scenes, termed repulsion loss. This loss is driven by two motivations: the attraction by target, and the repulsion by other surrounding objects. The repulsion term prevents the proposal from shifting to surrounding objects thus leading to more crowd-robust localization. Our detector trained by repulsion loss outperforms the state-ofthe-art methods with a significant improvement in occlusion cases.

show abstract

FoveaBox: Beyound Anchor-Based Object Detection

Kong¹,

Sun

Liu

et al. 2020

IEEE Trans. on Image Process.

745

322

View full text Add to dashboard Cite

We present FoveaBox, an accurate, flexible and completely anchor-free framework for object detection. While almost all state-of-the-art object detectors utilize the predefined anchors to enumerate possible locations, scales and aspect ratios for the search of the objects, their performance and generalization ability are also limited to the design of anchors. Instead, FoveaBox directly learns the object existing possibility and the bounding box coordinates without anchor reference. This is achieved by: (a) predicting category-sensitive semantic maps for the object existing possibility, and (b) producing category-agnostic bounding box for each position that potentially contains an object. The scales of target boxes are naturally associated with feature pyramid representations for each input image.Without bells and whistles, FoveaBox achieves state-ofthe-art single model performance of 42.1 AP on the standard COCO detection benchmark. Specially for the objects with arbitrary aspect ratios, FoveaBox brings in significant improvement compared to the anchor-based detectors. More surprisingly, when it is challenged by the stretched testing images, FoveaBox shows great robustness and generalization ability to the changed distribution of bounding box shapes. The code will be made publicly available.

show abstract

SOLO: Segmenting Objects by Locations

Kong²,

et al. 2020

View full text Add to dashboard Cite

MegDet: A Large Mini-Batch Object Detector

et al. 2018

View full text Add to dashboard Cite

The development of object detection in the era of deep learning, from R-CNN [11], Fast/Faster R-CNN [10, 31] to recent Mask R-CNN [14] and RetinaNet [24], mainly come from novel network, new framework, or loss design. However, mini-batch size, a key factor for the training of deep neural networks, has not been well studied for object detection. In this paper, we propose a Large Mini-Batch Object Detector (MegDet) to enable the training with a large minibatch size up to 256, so that we can effectively utilize at most 128 GPUs to significantly shorten the training time. Technically, we suggest a warmup learning rate policy and Cross-GPU Batch Normalization, which together allow us to successfully train a large mini-batch detector in much less time (e.g., from 33 hours to 4 hours), and achieve even better accuracy. The MegDet is the backbone of our submission (mmAP 52.5%) to COCO 2017 Challenge, where we won the 1st place of Detection task.1

show abstract

Extensive Facial Landmark Localization with Coarse-to-Fine Convolutional Network Cascade

Zhou¹,

Fan²,

Cao³

et al. 2013

315

198

View full text Add to dashboard Cite

We present a new approach to localize extensive facial landmarks with a coarse-to-fine convolutional network cascade. Deep convolutional neural networks (DCNN) have been successfully utilized in facial landmark localization for two-fold advantages: 1) geometric constraints among facial points are implicitly utilized; 2) huge amount of training data can be leveraged. However, in the task of extensive facial landmark localization, a large number of facial landmarks (more than 50 points) are required to be located in a unified system, which poses great difficulty in the structure design and training process of traditional convolutional networks. In this paper, we design a four-level convolutional network cascade, which tackles the problem in a coarse-to-fine manner. In our system, each network level is trained to locally refine a subset of facial landmarks generated by previous network levels. In addition, each level predicts explicit geometric constraints (the position and rotation angles of a specific facial component) to rectify the inputs of the current network level. The combination of coarse-to-fine cascade and geometric refinement enables our system to locate extensive facial landmarks (68 points) accurately in the 300-W facial landmark localization challenge.

show abstract

Controllable Person Image Synthesis With Attribute-Decomposed GAN

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yuning Jiang

Unified Perceptual Parsing for Scene Understanding

Acquisition of Localization Confidence for Accurate Object Detection

Repulsion Loss: Detecting Pedestrians in a Crowd

FoveaBox: Beyound Anchor-Based Object Detection

SOLO: Segmenting Objects by Locations

MegDet: A Large Mini-Batch Object Detector

Extensive Facial Landmark Localization with Coarse-to-Fine Convolutional Network Cascade

Controllable Person Image Synthesis With Attribute-Decomposed GAN

Contact Info

Product

Resources

About