Jiemin Fang scite author profile

Can Transformer perform 2D object-level recognition from a pure sequence-tosequence perspective with minimal knowledge about the 2D spatial structure? To answer this question, we present You Only Look at One Sequence (YOLOS), a series of object detection models based on the naïve Vision Transformer with the fewest possible modifications as well as inductive biases. We find that YO-LOS pre-trained on the mid-sized ImageNet-1k dataset only can already achieve competitive object detection performance on COCO, e.g., YOLOS-Base directly adopted from BERT-Base can achieve 42.0 box AP. We also discuss the impacts as well as limitations of current pre-train schemes and model scaling strategies for Transformer in vision through object detection. Code and model weights are available at https://github.com/hustvl/YOLOS. * Yuxin Fang and Bencheng Liao contributed equally. Xinggang Wang is the corresponding author. This work was done when Yuxin Fang was interning at Horizon Robotics mentored by Rui Wu.1 Recently, there are various sophisticated or hybrid architectures termed as "Vision Transformer". For disambiguation, in this paper, "Vision Transformer" and "ViT" refer to the naïve or vanilla Vision Transformer architecture proposed by Dosovitskiy et al. [20] unless specified.Preprint. Under review.

show abstract

FNA++: Fast Network Adaptation via Parameter Remapping and Architecture Search

Fang

Sun

Zhang³

et al. 2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Temporally Efficient Vision Transformer for Video Instance Segmentation

Yang

Wang

Liu³

et al. 2022

View full text Add to dashboard Cite

ResizeMix: Mixing Data with Preserved Object Information and True Labels

Qin¹,

Fang²,

Zhang³

et al. 2020

Preprint

View full text Add to dashboard Cite

Data augmentation is a powerful technique to increase the diversity of data, which can effectively improve the generalization ability of neural networks in image recognition tasks. Recent data mixing based augmentation strategies have achieved great success. Especially, CutMix uses a simple but effective method to improve the classifiers by randomly cropping a patch from one image and pasting it on another image. To further promote the performance of CutMix, a series of works explore to use the saliency information of the image to guide the mixing. We systematically study the importance of the saliency information for mixing data, and find that the saliency information is not so necessary for promoting the augmentation performance. Furthermore, we find that the cutting based data mixing methods carry two problems of label misallocation and object information missing, which cannot be resolved simultaneously. We propose a more effective but very easily implemented method, namely ResizeMix. We mix the data by directly resizing the source image to a small patch and paste it on another image. The obtained patch preserves more substantial object information compared with conventional cut-based methods. ResizeMix shows evident advantages over CutMix and the saliency-guided methods on both image classification and object detection tasks without additional computation cost, which even outperforms most costly search-based automatic augmentation methods.* Equal contribution. The work was performed during the internship of J. Qin and J. Fang at Horizon Robotics.

show abstract

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Fang¹,

Xie²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformers have offered a new methodology of designing neural networks for visual recognition. Compared to convolutional networks, Transformers enjoy the ability of referring to global features at each stage, yet the attention module brings higher computational overhead that obstructs the application of Transformers to process high-resolution visual data. This paper aims to alleviate the conflict between efficiency and flexibility, for which we propose a specialized token for each region that serves as a messenger (MSG). Hence, by manipulating these MSG tokens, one can flexibly exchange visual information across regions and the computational complexity is reduced. We then integrate the MSG token into a multi-scale architecture named MSG-Transformer. In standard image classification and object detection, MSG-Transformer achieves competitive performance and the inference on both GPU and CPU is accelerated. The code will be available at https://github.com/hustvl/MSG-Transformer.

show abstract

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiemin Fang

Hierarchical Aggregation for 3D Instance Segmentation

Densely Connected Search Space for More Flexible Neural Architecture Search

Fast Dynamic Radiance Fields with Time-Aware Neural Voxels

You Only Look at One Sequence: Rethinking Transformer in Vision through Object Detection

FNA++: Fast Network Adaptation via Parameter Remapping and Architecture Search

Temporally Efficient Vision Transformer for Video Instance Segmentation

ResizeMix: Mixing Data with Preserved Object Information and True Labels

MSG-Transformer: Exchanging Local Spatial Information by Manipulating Messenger Tokens

Contact Info

Product

Resources

About