2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01212
|View full text |Cite
|
Sign up to set email alerts
|

Pre-Trained Image Processing Transformer

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
430
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 1,167 publications
(541 citation statements)
references
References 50 publications
0
430
0
1
Order By: Relevance
“…For image restoration, IPT [44] jointly trains standard Transformer blocks with multi-tails and multiheads on multiple low-level vision tasks. However, their framework still needs pre-training on a large-scale synthesized dataset and requires multi-task learning for good performance.…”
Section: Related Workmentioning
confidence: 99%
“…For image restoration, IPT [44] jointly trains standard Transformer blocks with multi-tails and multiheads on multiple low-level vision tasks. However, their framework still needs pre-training on a large-scale synthesized dataset and requires multi-task learning for good performance.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, several attempts use Transformer to solve SR. For example, TTSR [35] proposes a texture Transformer by transferring HR textures from the reference image to the LR image. IPT [5] develops a new pre-trained model to study the low-level computer vision task, including SR. However, it is non-trivial and difficult to directly extend these Transformer-based image SR methods to VSR.…”
Section: Related Workmentioning
confidence: 99%
“…First, while the locality is well-known to be crucial for VSR, the fully connected self-attention (FCSA) layer neglects to leverage such information in a video sequence. Typically, most existing vision Transformer methods (e.g., ViT [8] and IPT [5]) split an image into several patches or tokens, which may damage the local spatial information [17] to some extent since the contents (e.g., lines, edges, shapes, and even objects) are divided into different tokens. In addition, this layer focuses on global interaction between the token embeddings by using several fully connected layers to compute attention maps which are irrelevant to local information.…”
Section: Introductionmentioning
confidence: 99%
“…: We gain long-range dependencies of image sequences by using transformer structure to learn sequential representation for input images. Different from most vision tasks [38], [40], [47], the inputs of the style transfer task belong to two very different domains, usually including artistic paintings and natural images. Therefore, StyT r 2 has two transformer encoders to encode domain-specific features, which are used to translate a sequence from one domain to another domain in the next stage.…”
Section: B Style Transfer Transformermentioning
confidence: 99%
“…Therefore, instead of recurrent neural networks [31], transformer is widely used in various NLP tasks [32]- [37]. Inspired by breakthrough of transformer in NLP, many researchers put forward vision transformers for various image tasks, including object detection [38]- [40], segmentation [41], [42], image classification [43]- [46], image processing and genetation [19], [45], [47]. Traditionally CNNs are designed to learn the local corrections within images containing inductive biases.…”
Section: Introductionmentioning
confidence: 99%