Image Transformer

Parmar, Niki; Vaswani, Ashish; Uszkoreit, Jakob; Kaiser, Łukasz; Shazeer, Noam; Ku, Alexander; Tran, Dustin

doi:10.48550/arxiv.1802.05751

Cited by 92 publications

(120 citation statements)

References 6 publications

(11 reference statements)

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…The method proposed in this paper follows the line of visual synthesis research based on auto-regressive models. Earlier visual auto-regressive models [5,28,39,41,44] performed visual synthesis in a "pixel-by-pixel" manner. However, due to the high computational cost when modeling high-dimensional data, such methods can be applied to lowresolution images or videos only, and are hard to scale up.…”

Section: Visual Auto-regressive Modelsmentioning

confidence: 99%

“…However, the quality of generated visual contents could be harmed due to the limited contexts used in self-attention. [6,28,32] proposed to use local-wise sparse attention in visual synthesis tasks, which allows the models to see more contexts. But these works were for images only.…”

Section: Visual Sparse Self-attentionmentioning

confidence: 99%

“…Auto-regressive models [33,39,41,45] play an important role in visual synthesis tasks, due to their explicit density modeling and stable training advantages compared with GANs [4,30,37,47]. Earlier visual auto-regressive models, such as PixelCNN [39], PixelRNN [41], Image Transformer [28], iGPT [5], and Video Transformer [44], performed visual synthesis in a "pixel-by-pixel" manner. However, due to their high computational cost on high-dimensional visual data, such methods can be applied to low-resolution images or videos only and are hard to scale up.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Wu¹,

Ji²,

Ji³

et al. 2021

Preprint

View full text Add to dashboard Cite

Section: Visual Auto-regressive Modelsmentioning

confidence: 99%

Section: Visual Sparse Self-attentionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

Wu¹,

Ji²,

Ji³

et al. 2021

Preprint

View full text Add to dashboard Cite

“…Wang et al [37] formalized self-attention as a non-local operation to explore the spatialtemporal dependencies' effectiveness in video and image sequences. Parmar et al [38] introduced Image Transformer, applying the self-attention model into an autoregressive model for image generation. Zhang et al [39] proposed SAGAN, which allowed the self-attention-driven and long-range dependency model for learning a better image generation.…”

Section: Self-attention and Transformermentioning

confidence: 99%

DA-FDFtNet: Dual Attention Fake Detection Fine-tuning Network to Detect Various AI-Generated Fake Images

Bang¹,

Woo²

2021

Preprint

View full text Add to dashboard Cite

Due to the advancement of Generative Adversarial Networks (GAN), Autoencoders, and other AI technologies, it has been much easier to create fake images such as "Deepfakes." More recent research has introduced few-shot learning, which uses a small amount of training data to produce fake images and videos more effectively. Therefore, the ease of generating manipulated images and the difficulty of distinguishing those images can cause a serious threat to our society, such as propagating fake information. However, detecting realistic fake images generated by the latest AI technology is challenging due to the reasons mentioned above. In this work, we propose Dual Attention Fake Detection Fine-tuning Network (DA-FDFtNet) to detect the manipulated fake face images from the real face data. Our DA-FDFtNet integrates the pre-trained model with Fine-Tune Transformer, MBblockV3, and a channel attention module to improve the performance and robustness across different types of fake images. In particular, Fine-Tune Transformer consists of multiple numbers of an image-based self-attention module and a down-sampling layer. The channel attention module is also connected with the pre-trained model to capture the fake images feature space. We experiment with our DA-FDFtNet with the FaceForensics++ dataset and various GAN-generated datasets, and we show that our approach outperforms the previous baseline models.

show abstract

“…This mechanism allows more computation parallelization with higher performance. In the computer vision domain, some research have leveraged the transformer architecture and showed the effectiveness of some problems [4] [5] Inspired by the transformer network, in this paper, we propose a self-attention based scene text recognizer with focal loss, namely as SAFL. Moreover, to tackle irregular shapes of scene texts, we also exploit a text rectification named Spatial Transformer Network (STN) to enhance the quality of text before passing to the recognition network.…”

Section: Introductionmentioning

confidence: 99%

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

Tran

Le-Cong

Nguyen

et al. 2020

2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA)

View full text Add to dashboard Cite

In the last decades, scene text recognition has gained worldwide attention from both the academic community and actual users due to its importance in a wide range of applications. Despite achievements in optical character recognition, scene text recognition remains challenging due to inherent problems such as distortions or irregular layout. Most of the existing approaches mainly leverage recurrence or convolution-based neural networks. However, while recurrent neural networks (RNNs) usually suffer from slow training speed due to sequential computation and encounter problems as vanishing gradient or bottleneck, CNN endures a trade-off between complexity and performance. In this paper, we introduce SAFL, a self-attentionbased neural network model with the focal loss for scene text recognition, to overcome the limitation of the existing approaches. The use of focal loss instead of negative log-likelihood helps the model focus more on low-frequency samples training. Moreover, to deal with the distortions and irregular texts, we exploit Spatial TransformerNetwork (STN) to rectify text before passing to the recognition network. We perform experiments to compare the performance of the proposed model with seven benchmarks. The numerical results show that our model achieves the best performance.

show abstract

Image Transformer

Cited by 92 publications

References 6 publications

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion

DA-FDFtNet: Dual Attention Fake Detection Fine-tuning Network to Detect Various AI-Generated Fake Images

SAFL: A Self-Attention Scene Text Recognizer with Focal Loss

Contact Info

Product

Resources

About