StyTr$^2$: Image Style Transfer with Transformers

Deng, Yingying; Tang, Fan; Dong, Weiming; Ma, Chongyang; Pan, Xingjia; Wang, Lei; Xu, Changsheng

doi:10.48550/arxiv.2105.14576

Cited by 8 publications

(10 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We demonstrated the effectiveness of our method on different types of conditional image generation tasks. SOTA pre-trained deterministic models, including LaMa (Suvorov et al 2022) for inpainting and StyTr 2 (Deng et al 2021) for style transfer, are used to qualitatively and quantitatively validate the feasibility and effectiveness of our proposed method. In addition, we also conducted experiments on tasks such as super-resolution, dehazing, and probabilistic generation to further discuss the generalizability and limitations of our proposed method.…”

Section: Methodsmentioning

confidence: 99%

“…(Sanakoyeu et al 2018) introduced GAN structure for style transfer. Subsequent works improve the performance of neural style transfer in many aspects, including quality (Deng et al 2021) and generalization (Chiu 2019). (Dong et al 2015) took the lead in introducing learning-based method into well-posed vision tasks, e.g., super-resolution, denoising, and JPEG compression artifact reduction.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

Chu,

Xing,

Chen

et al. 2024

AAAI

View full text Add to dashboard Cite

Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

Chu,

Xing,

Chen

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…Token adoptation in vision tasks: At the moment, tokenbased models are widely applied in almost all domains in vision, including classification [22,46,65], object detection [6,16,90], segmentation [23,74], image generation [5,20,24,38,43], video understanding [1,2,4,9,25,28,41,45,47,49,56,85], dense prediction [54,75], point clouds processing [30,88], reinforcement learning [10,37] and tracking [60].…”

Section: Related Workmentioning

confidence: 99%

SWAT: Spatial Structure Within and Among Tokens

Kahatapitiya¹,

Ryoo²

2021

Preprint

View full text Add to dashboard Cite

Modeling visual data as tokens (i.e., image patches), and applying attention mechanisms or feed-forward networks on top of them has shown to be highly effective in recent years. The common pipeline in such approaches includes a tokenization method, followed by a set of layers/blocks for information mixing, both within tokens and among tokens. In common practice, image patches are flattened when converted into tokens, discarding the spatial structure within each patch. Next, a module such as multi-head self-attention captures the pairwise relations among the tokens and mixes them. In this paper, we argue that models can have significant gains when spatial structure is preserved in tokenization, and is explicitly used in the mixing stage. We propose two key contributions: (1) Structure-aware Tokenization and, (2) Structure-aware Mixing, both of which can be combined with existing models with minimal effort. We introduce a family of models (SWAT), showing improvements over the likes of DeiT, MLP-Mixer and Swin Transformer, across multiple benchmarks including ImageNet classification and ADE20K segmentation. Our code and models will be released online.

show abstract

“…Multiple-Style-Per-Model NST methods have included Dumoulin et al [29], Li et al [30] and Zhang and Dana [31]. Finally GANs [32], CycleGANs [33] and image transformers [34] have been recently used for NST. Although there have been many advances in this field, the Gatys method is still considered to be the gold standard by most researchers in terms of the quality of its results [20].…”

Section: Feature Map Fusion Using Image Optimisationmentioning

confidence: 99%

Semantic Image Fusion

Hill,

Bull

2021

Preprint

View full text Add to dashboard Cite

Image fusion methods and metrics for their evaluation have conventionally used pixel based or low level features. However, for many applications the aim of image fusion is to effectively combine the semantic content of the input images. This paper proposes a novel system for the semantic combination of visual content using pre-trained CNN network architectures. Our proposed semantic fusion is initiated through the fusion of the top layer feature map outputs (for each input image) through gradient updating of the fused image input (so called image optimisation). Simple 'choose maximum' and 'local majority' filter based fusion rules are utilised for feature map fusion. This provides a simple method to combine layer outputs and thus a unique framework to fuse single channel and colour images within a decomposition pre-trained for classification and therefore aligned with semantic fusion. Furthermore, class activation mappings of each input image are used to combine semantic information at a higher level. The developed methods are able to give equivalent low level fusion performance to state of the art methods while providing a unique architecture to combine semantic information from multiple images.

show abstract

StyTr$^2$: Image Style Transfer with Transformers

Cited by 8 publications

References 36 publications

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

Attack Deterministic Conditional Image Generative Models for Diverse and Controllable Generation

SWAT: Spatial Structure Within and Among Tokens

Semantic Image Fusion

Contact Info

Product

Resources

About