2018
DOI: 10.48550/arxiv.1802.05751
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Image Transformer

Abstract: Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the selfattention mechanism to attend to local neighborhoods we significantly increase the size of ima… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
111
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 92 publications
(120 citation statements)
references
References 6 publications
(11 reference statements)
3
111
0
Order By: Relevance
“…The method proposed in this paper follows the line of visual synthesis research based on auto-regressive models. Earlier visual auto-regressive models [5,28,39,41,44] performed visual synthesis in a "pixel-by-pixel" manner. However, due to the high computational cost when modeling high-dimensional data, such methods can be applied to lowresolution images or videos only, and are hard to scale up.…”
Section: Visual Auto-regressive Modelsmentioning
confidence: 99%
See 2 more Smart Citations
“…The method proposed in this paper follows the line of visual synthesis research based on auto-regressive models. Earlier visual auto-regressive models [5,28,39,41,44] performed visual synthesis in a "pixel-by-pixel" manner. However, due to the high computational cost when modeling high-dimensional data, such methods can be applied to lowresolution images or videos only, and are hard to scale up.…”
Section: Visual Auto-regressive Modelsmentioning
confidence: 99%
“…However, the quality of generated visual contents could be harmed due to the limited contexts used in self-attention. [6,28,32] proposed to use local-wise sparse attention in visual synthesis tasks, which allows the models to see more contexts. But these works were for images only.…”
Section: Visual Sparse Self-attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…Wang et al [37] formalized self-attention as a non-local operation to explore the spatialtemporal dependencies' effectiveness in video and image sequences. Parmar et al [38] introduced Image Transformer, applying the self-attention model into an autoregressive model for image generation. Zhang et al [39] proposed SAGAN, which allowed the self-attention-driven and long-range dependency model for learning a better image generation.…”
Section: Self-attention and Transformermentioning
confidence: 99%
“…This mechanism allows more computation parallelization with higher performance. In the computer vision domain, some research have leveraged the transformer architecture and showed the effectiveness of some problems [4] [5] Inspired by the transformer network, in this paper, we propose a self-attention based scene text recognizer with focal loss, namely as SAFL. Moreover, to tackle irregular shapes of scene texts, we also exploit a text rectification named Spatial Transformer Network (STN) to enhance the quality of text before passing to the recognition network.…”
Section: Introductionmentioning
confidence: 99%