PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices

Yu, Guanghua; Chang, Qinyao; Lv, Wenyu; Xu, Chang; Cui, Cheng; Ji, Wei; Dang, Qingqing; Deng, Kaipeng; Wang, Guanzhong; Du, Yuning; Lai, Baohua; Liu, Qiwen; Hu, Xiaoguang; Yu, Dianhai; Ma, Yanjun

doi:10.48550/arxiv.2111.00902

Cited by 37 publications

(42 citation statements)

References 27 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Compared with state-of-the-art CNN-based object detectors (e.g., YoloX [58], EfficientDet [38], PP-PicoDet-L [61]), EfficientViT also provides significant improvements. Specifically, EfficientViT-Det-r608 provides 1.7 AP improvement over PP-PicoDet-L and requires slightly fewer MACs.…”

Section: Coco Object Detectionmentioning

confidence: 99%

See 1 more Smart Citation

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Han¹,

Gan²,

Han³

2022

Preprint

View full text Add to dashboard Cite

Vision Transformer (ViT) has achieved remarkable performance in many vision tasks. However, ViT is inferior to convolutional neural networks (CNNs) when targeting high-resolution mobile vision applications. The key computational bottleneck of ViT is the softmax attention module which has quadratic computational complexity with the input resolution. It is essential to reduce the cost of ViT to deploy it on edge devices. Existing methods (e.g., Swin, PVT) restrict the softmax attention within local windows or reduce the resolution of key/value tensors to reduce the cost, which sacrifices ViT's core advantages on global feature extractions. In this work, we present EfficientViT, an efficient ViT architecture for high-resolution low-computation visual recognition. Instead of restricting the softmax attention, we propose to replace softmax attention with linear attention while enhancing its local feature extraction ability with depthwise convolution. EfficientViT maintains global and local feature extraction capability while enjoying linear computational complexity. Extensive experiments on COCO object detection and Cityscapes semantic segmentation demonstrate the effectiveness of our method. On the COCO dataset, EfficientViT achieves 42.6 AP with 4.4G MACs, surpassing EfficientDet-D1 by 2.4 AP while having 27.9% fewer MACs. On Cityscapes, EfficientViT reaches 78.7 mIoU with 19.1G MACs, outperforming SegFormer by 2.5 mIoU while requiring less than 1/3 the computational cost. On Qualcomm Snapdragon 855 CPU, EfficientViT is 3× faster than EfficientNet while achieving higher ImageNet accuracy.Preprint. Under review.

show abstract

Section: Coco Object Detectionmentioning

confidence: 99%

“…† denotes the best result we find for CNN-based mobile object detection, which is achieved with a bunch of additional techniques (e.g., neural architecture search, ghost module, CSP, Cycle-EMA, etc.). Compared with this strong baseline (PP-PicoDet-L[61]), EfficientViT provides 1.7 higher AP with slightly lower MACs.…”

mentioning

confidence: 97%

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

Han¹,

Gan²,

Han³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The main idea of knowledge distillation is to distill knowledge from a large model to a small model. Nowadays, lightweight networks have become a popular research direction in object detection, such as PP-PicoDet [31], Nanodet [32], and YOLO-Fastest [33]. They have significantly reduced the number of model parameters and improved the detection speed, but the accuracy is comparatively low.…”

Section: Related Workmentioning

confidence: 99%

A Lightweight Sea Surface Object Detection Network for Unmanned Surface Vehicles

Yang¹,

Li²,

Wang³

et al. 2022

JMSE

View full text Add to dashboard Cite

For unmanned surface vehicles (USVs), perception and control are commonly performed in embedded devices with limited computing power. Sea surface object detection can provide sufficient information for USVs, while most algorithms have poor real-time performance on embedded devices. To achieve real-time object detection on the USV platform, this paper designs a lightweight object detection network based on YOLO v5. In our work, an improved ShuffleNet v2 based on the attention mechanism was adopted as a backbone network to extract features. The depth-wise separable convolution module was introduced to rebuild the neck network. Additionally, the fusion method was changed from Concat to Add to optimize the feature fusion module. Experiments show that the proposed method reached 32.64 frames per second (FPS) on the Nvidia Jetson AGX Xavier and achieved a mean average precision (mAP) of 93.1% and 93.9% on our dataset and Singapore Maritime Dataset, respectively. Moreover, the number of model parameters of the proposed network was only 25% of that of YOLO v5n. The proposed network achieves a better balance between speed and accuracy, which is more suitable for detecting sea surface objects for USVs.

show abstract

“…Alternatively, ViLBERT [26] and LXMERT [27] introduced the two-stream architecture, where two transformers are applied to images and text independently, which is fused by a third transformer in a later stage. These models typically rely on region-based image features extracted a pre-trained object detectors based on commonly used two-staged detectors (typically Faster R-CNN model [28] or its extension Mask-RCNN [29]), or single-stage detectors (typically SSD and YOLO V3 [30]) or anchor-free detectors(e.g., [31]). Another directions are patch embedding [32,33,34,35,36].…”

Section: Related Workmentioning

confidence: 99%

Logically at Factify 2022: Multimodal Fact Verification

Gao¹,

Hoffmann²,

Oikonomou³

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper describes our participant system for the multi-modal fact verification (Factify) challenge at AAAI 2022. Despite the recent advance in text based verification techniques and large pre-trained multimodal models cross vision and language, very limited work has been done in applying multimodal techniques to automate fact checking process, particularly considering the increasing prevalence of claims and fake news about images and videos on social media. In our work, the challenge is treated as multimodal entailment task and framed as multi-class classification. Two baseline approaches are proposed and explored including an ensemble model (combining two uni-modal models) and a multimodal attention network (modeling the interaction between image and text pair from claim and evidence document). We conduct several experiments investigating and benchmarking different SoTA pre-trained transformers and vision models in this work. Our best model is ranked first in leaderboard which obtains a weighted average F-measure of 0.77 on both validation and test set. Exploratory analysis of dataset is also carried out on the Factify data set and uncovers salient patterns and issues (e.g., word overlapping, visual entailment correlation, source bias) that motivates our hypothesis. Finally, we highlight challenges of the task and multimodal dataset for future research.

show abstract

PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices

Cited by 37 publications

References 27 publications

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

EfficientViT: Enhanced Linear Attention for High-Resolution Low-Computation Visual Recognition

A Lightweight Sea Surface Object Detection Network for Unmanned Surface Vehicles

Logically at Factify 2022: Multimodal Fact Verification

Contact Info

Product

Resources

About