Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving

Cai, Mu; Zhang, Hong; Huang, Huijuan; Geng, Qichuan; Huang, Gao

doi:10.1109/iccv48922.2021.01367

Cited by 47 publications

(33 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More recently, Gal et al [9] propose a wavelet based image generation method. Jiang et al [15] introduce the focal frequency loss which focuses on hard frequencies.Meanwhile, there are some other work [4,17] on image restoration in the frequency domain.…”

Section: Loss Functionsmentioning

confidence: 99%

See 1 more Smart Citation

GLaMa: Joint Spatial and Frequency Loss for General Image Inpainting

Lu¹,

Jiang²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The purpose of image inpainting is to recover scratches and damaged areas using context information from remaining parts. In recent years, thanks to the resurgence of convolutional neural networks (CNNs), image inpainting task has made great breakthroughs. However, most of the work consider insufficient types of mask, and their performance will drop dramatically when encountering unseen masks. To combat these challenges, we propose a simple yet general method to solve this problem based on the LaMa image inpainting framework [35], dubbed GLaMa. Our proposed GLaMa can better capture different types of missing information by using more types of masks. By incorporating more degraded images in the training phase, we can expect to enhance the robustness of the model with respect to various masks. In order to yield more reasonable results, we further introduce a frequency-based loss in addition to the traditional spatial reconstruction loss and adversarial loss. In particular, we introduce an effective reconstruction loss both in the spatial and frequency domain to reduce the chessboard effect and ripples in the reconstructed image. Extensive experiments demonstrate that our method can boost the performance over the original LaMa method for each type of mask on FFHQ [18], ImageNet [7], Places2 [42] and WikiArt [28] dataset. The proposed GLaMa was ranked first in terms of PSNR, LPIPS [39] and SSIM [34] in the NTIRE 2022 Image Inpainting Challenge Track 1 Unsupervised [27].

show abstract

Section: Loss Functionsmentioning

confidence: 99%

“…After a lot of experiments, we find that the policy of mask generation noticeably influences the performance of the inpainting model as shown in Tab. 4.…”

Section: Training With General Maskmentioning

confidence: 99%

GLaMa: Joint Spatial and Frequency Loss for General Image Inpainting

Lu¹,

Jiang²,

Huang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…First, to obtain the latent code 𝑤 1 ∈ W + of the first frame, we make use of the existing GAN inversion model called FDIT [28], which aims to invert an image back to the latent space of a pre-trained generator. Then 𝑤 1 is used for initial cell 𝐿𝑆𝑇 𝑀 𝑖 to get the initialization state of motion prediction:…”

Section: B Motion Inferencementioning

confidence: 99%

Text-driven Video Prediction

Xue¹,

Chen²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Current video generation models usually convert signals indicating appearance and motion received from inputs (e.g., image, text) or latent spaces (e.g., noise vectors) into consecutive frames, fulfilling a stochastic generation process for the uncertainty introduced by latent code sampling. However, this generation pattern lacks deterministic constraints for both appearance and motion, leading to uncontrollable and undesirable outcomes. To this end, we propose a new task called Text-driven Video Prediction (TVP). Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. Specifically, appearance and motion components are provided by the image and caption separately. The key to addressing the TVP task depends on fully exploring the underlying motion information in text descriptions, thus facilitating plausible video generation. In fact, this task is intrinsically a cause-and-effect problem, as the text content directly influences the motion changes of frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM), producing step-wise embeddings to regulate motion inference for subsequent frames. In particular, a refinement mechanism incorporating global motion semantics guarantees coherent generation. Extensive experiments are conducted on Something-Something V2 and Single Moving MNIST datasets. Experimental results demonstrate that our model achieves better results over other baselines, verifying the effectiveness of the proposed framework.

show abstract

“…The image-to-image translation task aims to match the style of images from a source domain into a target domain, while retaining the original structure of the source images [3,23,37]. For current unpaired image-to-image translation datasets, the generative adversarial nets (GAN) [10,31,43,41,24] are able to generate images that match the style, but the adversarial loss suffers from the collapse of the structure.…”

Section: Introductionmentioning

confidence: 99%

“…To alleviate this problem, a family of methods based on cycle-consistency learning [44,3,12,9,22] have been proposed. Cycle-consistency assumes that there exists a reversible relationship between the source image and the target image.…”

Section: Introductionmentioning

confidence: 99%

Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation

Lin¹,

Zhang²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Figure 1: Visualization of the learned similarity by the feature extractor. Given an input and output image, we extract the features of these images through a feature extractor. We compute the learned similarities between the feature vectors of [(v, v − 1 ), ..., (v, v − N )] by using exp(v • v − /τ ). Specifically, v is a query element (the highlighted red dot in the output) and [v − 1 , ..., v − N ] are all the candidate patches in the input. Compared with other I2I translation method [28], the feature extractor of our model learns the cross-domain correspondence with a better saliency effect.

show abstract

Frequency Domain Image Translation: More Photo-realistic, Better Identity-preserving

Cited by 47 publications

References 50 publications

GLaMa: Joint Spatial and Frequency Loss for General Image Inpainting

GLaMa: Joint Spatial and Frequency Loss for General Image Inpainting

Text-driven Video Prediction

Exploring Negatives in Contrastive Learning for Unpaired Image-to-Image Translation

Contact Info

Product

Resources

About