Yixuan Wei scite author profile

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting selfattention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (86.4 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test-dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The code and models will be made publicly available at https:// github.com/microsoft/Swin-Transformer.* Equal contribution. † Interns at MSRA. ‡ Contact person.

show abstract

DeepHuman: 3D Human Reconstruction From a Single Image

Zheng

Wei

et al. 2019

315

289

View full text Add to dashboard Cite

We propose DeepHuman, an image-guided volume-tovolume translation CNN for 3D human reconstruction from a single RGB image. To reduce the ambiguities associated with the surface geometry reconstruction, even for the reconstruction of invisible areas, we propose and leverage a dense semantic representation generated from SMPL model as an additional input. One key feature of our network is that it fuses different scales of image features into the 3D space through volumetric feature transformation, which helps to recover accurate surface geometry. The visible surface details are further refined through a normal refinement network, which can be concatenated with the volume generation network using our proposed volumetric normal projection layer. We also contribute THuman, a 3D real-world human model dataset containing about 7000 models. The network is trained using training data generated from the dataset. Overall, due to the specific design of our network and the diversity in our dataset, our method enables 3D human model estimation given only a single image and outperforms state-of-the-art approaches.

show abstract

Histone Modifications Regulate Chromatin Compartmentalization by Contributing to a Phase Separation Mechanism

et al. 2019

View full text Add to dashboard Cite

show abstract

A review of data-driven approaches for prediction and classification of building energy consumption

Wei

Zhang

Shi

et al. 2018

Renewable and Sustainable Energy Reviews

525

215

View full text Add to dashboard Cite

Swin Transformer V2: Scaling Up Capacity and Resolution

et al. 2022

View full text Add to dashboard Cite

Video Swin Transformer

et al. 2022

View full text Add to dashboard Cite

Swin Transformer V2: Scaling Up Capacity and Resolution

Liu¹,

Hu²,

Lin³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present techniques for scaling Swin Transformer [35] up to 3 billion parameters and making it capable of training with images of up to 1,536×1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks: 84.0% top-1 accuracy on ImageNet-V2 image classification, 63.1 / 54.4 box / mask mAP on COCO object detection, 59.9 mIoU on ADE20K semantic segmentation, and 86.8% top-1 accuracy on Kinetics-400 video action classification. Our techniques are generally applicable for scaling up vision models, which * Equal. † Project lead. Ze, Yutong, Zhuliang, Zhenda, Yixuan, Jia are long-term interns at MSRA.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yixuan Wei

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

DeepHuman: 3D Human Reconstruction From a Single Image

Histone Modifications Regulate Chromatin Compartmentalization by Contributing to a Phase Separation Mechanism

A review of data-driven approaches for prediction and classification of building energy consumption

Swin Transformer V2: Scaling Up Capacity and Resolution

Video Swin Transformer

Swin Transformer V2: Scaling Up Capacity and Resolution

Contact Info

Product

Resources

About