Abstract:Adversarial purification refers to a class of defense methods that remove adversarial perturbations using a generative model. These methods do not make assumptions on the form of attack and the classification model, and thus can defend pre-existing classifiers against unseen threats. However, their performance currently falls behind adversarial training methods. In this work, we propose DiffPure that uses diffusion models for adversarial purification: Given an adversarial example, we first diffuse it with a sm… Show more
“…The work of [15] first studied this problem-however they do not obtain significant accuracy improvements, likely due to the fact that diffusion models available at the time that work was done were not good enough. Separately, [18] suggest that diffusion models might be able to provide strong empirical robustness to adversarial examples, as evaluated by robustness under adversarial attacks computed using existing attack algorithms; this is orthogonal to our results.…”
In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2 norm of ε = 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
“…The work of [15] first studied this problem-however they do not obtain significant accuracy improvements, likely due to the fact that diffusion models available at the time that work was done were not good enough. Separately, [18] suggest that diffusion models might be able to provide strong empirical robustness to adversarial examples, as evaluated by robustness under adversarial attacks computed using existing attack algorithms; this is orthogonal to our results.…”
In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2-norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2 norm of ε = 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.
“…The robustness continues to scale with model capacity, and RobArch-L achieves the new SOTA AA accuracy on the Robust-Bench leaderboard. It is important to note that ResNet-50+DiffPure [39] designed a novel AT method via using diffusion models [22] for adversarial purification. Although the method improves the AA accuracy by 5.97 percentage points, our architecture modifications show stronger robustness even without finetuning the Standard-AT method.…”
Adversarial Training is the most effective approach for improving the robustness of Deep Neural Networks (DNNs). However, compared to the large body of research in optimizing the adversarial training process, there are few investigations into how architecture components affect robustness, and they rarely constrain model capacity. Thus, it is unclear where robustness precisely comes from. In this work, we present the first large-scale systematic study on the robustness of DNN architecture components under fixed parameter budgets. Through our investigation, we distill 18 actionable robust network design guidelines that empower model developers to gain deep insights. We demonstrate these guidelines' effectiveness by introducing the novel Robust Architecture (RobArch) model that instantiates the guidelines to build a family of top-performing models across parameter capacities against strong adversarial attacks. RobArch achieves the new state-of-the-art AutoAttack accuracy on the RobustBench ImageNet leaderboard. The code is available at https://github.com/ShengYun-Peng/RobArch.
“…Existing research has explored the adversarial example for different generative models yet no proper frameworks have been formulated. Diffusion models are used to improve the adversarial robustness of classifiers (Nie et al, 2022). Kos (Kos et al, 2018).…”
Section: Related Workmentioning
confidence: 99%
“…Furthermore, the training objective of diffusion models is optimized indirectly through a variational bound and thus is not applicable in the optimization of the adversarial example. For these reasons, existing research only considers diffusion models as assists to improve the robustness of classifiers (Nie et al, 2022), leaving a blank in the formulation of adversarial examples for diffusion models.…”
Diffusion Models (DMs) achieve state-of-the-art performance in generative tasks, boosting a wave in AI for Art. Despite the success of commercialization, DMs meanwhile provide tools for copyright violations, where infringers benefit from illegally using paintings created by human artists to train DMs and generate novel paintings in a similar style. In this paper, we show that it is possible to create an image x that is similar to an image x for human vision but unrecognizable for DMs. We build a framework to define and evaluate this adversarial example for diffusion models. Based on the framework, we further propose AdvDM, an algorithm to generate adversarial examples for DMs. By optimizing upon different latent variables sampled from the reverse process of DMs, AdvDM conducts a Monte-Carlo estimation of adversarial examples for DMs. Extensive experiments show that the estimated adversarial examples can effectively hinder DMs from extracting their features. Our method can be a powerful tool for human artists to protect their copyright against infringers with DM-based AI-for-Art applications.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.