Pre-Trained Image Processing Transformer

Chen, Hanting; Wang, Yunhe; Guo, Tianyu; Xu, Chang; Deng, Yiping; Liu, Zhenhua; Ma, Siwei; Xu, Chunjing; Xu, Chao; Gao, Wen

doi:10.48550/arxiv.2012.00364

Cited by 80 publications

(100 citation statements)

References 84 publications

(58 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inspired by a series of recent vision transformer (ViT) works [6,7,10,50], we determine to use the ViT architecture which has two advantages for the body reconstruction refinement task. Firstly, ViT follows a sequence prediction format by regarding the input image as a sequence of different local patches.…”

Section: Mesh Refinement Transformermentioning

confidence: 99%

See 1 more Smart Citation

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Liu

Zhu

Yang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

3D human pose and shape recovery from a monocular RGB image is a challenging task. Existing learning based methods highly depend on weak supervision signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However, considering the 2D-to-3D ambiguities existed in these weak supervision labels, the network is easy to get stuck in local optima when trained with such labels. In this paper, we reduce the ambituity by optimizing multiple initializations. Specifically, we propose a three-stage framework named Multi-Initialization Optimization Network (MION). In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample. Each coarse reconstruction can be regarded as an initialization leads to one optimization branch. In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism. Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction. Experiments demonstrate that our Multi-Initialization Optimization Network outperforms existing 3D mesh based methods on multiple public benchmarks.

show abstract

Section: Mesh Refinement Transformermentioning

confidence: 99%

“…To construct our transformer network, we first use a backbone network (e.g. resnet) to extract image feature [7]. Then three deconvolution layers are added to the top layer of backbone to upsample the feature map and recover more spatial information.…”

Section: Mesh Refinement Transformermentioning

confidence: 99%

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Liu

Zhu

Yang

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Since the results are not particularly satisfactory, its follow-up UP-DETR (Dai et al 2020) puts forward a random query patch detection method and boosts the performance of DETR with faster convergence and higher precision. IPT (Chen et al 2020a) generates corrupted image pairs from ImageNet (Deng et al 2009) and pretrains transformer on them. By fine-tuning the model in low-level CV tasks such as denoising, super-resolution and deraining, IPT outperforms contemporaneous approaches.…”

Section: Related Workmentioning

confidence: 99%

Vision Pair Learning: An Efficient Training Framework for Image Classification

Tong

2021

Preprint

View full text Add to dashboard Cite

Transformer is a potentially powerful architecture for vision tasks. Although equipped with more parameters and attention mechanism, its performance is not as dominant as CNN currently. CNN is usually computationally cheaper and still the leading competitor in various vision tasks. One research direction is to adopt the successful ideas of CNN and improve transformer, but it often relies on elaborated and heuristic network design. Observing that transformer and CNN are complementary in representation learning and convergence speed, we propose an efficient training framework called Vision Pair Learning (VPL) for image classification task. VPL builds up a network composed of a transformer branch, a CNN branch and pair learning module. With multi-stage training strategy, VPL enables the branches to learn from their partners during the appropriate stage of the training process, and makes them both achieve better performance with less time cost. Without external data, VPL promotes the top-1 accuracy of ViT-Base and ResNet-50 on the ImageNet-1k validation set to 83.47% and 79.61% respectively. Experiments on other datasets of various domains prove the efficacy of VPL and suggest that transformer performs better when paired with the differently structured CNN in VPL. we also analyze the importance of components through ablation study.

show abstract

“…Among them, Pre-training and Meta-learning are two representative technologies, which have been also explored in image restoration. For instance, IPT [3] introduces a large-scale pre-training dataset to improve the restoration performance w.r.t the target distortion. Soh et al [31] propose a meta-learning based method to implement the fast adaptation for zero-shot super-resolution task, achieving a SOTA performance.…”

Section: Corresponding Authormentioning

confidence: 99%

“…Pre-training based transfer learning. As the basic technology of transfer learning, pre-training has been widely applied to different vision tasks [3,15]. Pre-training based transfer learning can be divided into two processes, respectively as pre-training and fin-tuning.…”

Section: Knowledge Preliminarymentioning

confidence: 99%

Few-Shot Real Image Super-resolution via Distortion-Relation Guided Transfer Learning

Li¹,

Jin²,

Fu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Collecting large clean-distorted training image pairs in real world is non-trivial, which seriously limits the practical applications of these supervised learning based image restoration (IR) methods. Previous works attempt to address this problem by leveraging unsupervised learning technologies to alleviate the dependency for paired training samples. However, these methods typically suffer from unsatisfactory textures synthesis due to the lack of clean image supervision. Compared with purely unsupervised solution, the under-explored scheme with Few-Shot clean images (FS-IR) is more feasible to tackle this challenging real Image Restoration task. In this paper, we are the first to investigate the few-shot real image restoration and propose a Distortion-Relation guided Transfer Learning (termed as DRTL) framework. DRTL assigns a knowledge graph to capture the distortion relation between auxiliary tasks (i.e., synthetic distortions) and target tasks (i.e., real distortions with few images), and then adopt a gradient weighting strategy to guide the knowledge transfer from auxiliary task to target task. In this way, DRTL could quickly learn the most relevant knowledge from the prior distortions for target distortion. We instantiate DRTL integrated with pre-training and meta-learning pipelines as an embodiment to realize a distortion-relation aware FS-IR. Extensive experiments on multiple benchmarks demonstrate the effectiveness of DRTL on few-shot real image restoration.

show abstract

Pre-Trained Image Processing Transformer

Cited by 80 publications

References 84 publications

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Multi-initialization Optimization Network for Accurate 3D Human Pose and Shape Estimation

Vision Pair Learning: An Efficient Training Framework for Image Classification

Few-Shot Real Image Super-resolution via Distortion-Relation Guided Transfer Learning

Contact Info

Product

Resources

About