Black-box Detection of Backdoor Attacks with Limited Information and Data

Dong, Yinpeng; Yang, Xiao; Deng, Zhongliang; Pang, Tianyu; Xiao, Zihao; Su, Hang; Zhu, Jun

doi:10.1109/iccv48922.2021.01617

Cited by 49 publications

(27 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Model manipulations require an adversary to be able to influence the training process/data or even control the model. This is enabled by poisoning attacks [43,77,78] or constituted with query-based access only [24,34,57]; for instance, if models are deployed in embedded systems or on MLaaS platforms. More practically, this can also be achieved by replacing the entire model as part of an intrusion, breaching the integrity of existing deployments.…”

Section: B Model Manipulationmentioning

confidence: 99%

Backdooring Explainable Machine Learning

Noppel¹,

Peter²,

Wressnegger³

2022

Preprint

View full text Add to dashboard Cite

Explainable machine learning holds great potential for analyzing and understanding learning-based systems. These methods can, however, be manipulated to present unfaithful explanations, giving rise to powerful and stealthy adversaries. In this paper, we demonstrate blinding attacks that can fully disguise an ongoing attack against the machine learning model. Similar to neural backdoors, we modify the model's prediction upon trigger presence but simultaneously also fool the provided explanation. This enables an adversary to hide the presence of the trigger or point the explanation to entirely different portions of the input, throwing a red herring. We analyze different manifestations of such attacks for different explanation types in the image domain, before we resume to conduct a red-herring attack against malware classification.

show abstract

Section: B Model Manipulationmentioning

confidence: 99%

Backdooring Explainable Machine Learning

Noppel¹,

Peter²,

Wressnegger³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Moreover, training the shadow networks requires a relatively large number of clean samples and is highly computational. REDs, another family of PT defenses, trial reverse-engineer the BP for each putative target class [42,14,25,9,43,45]. Such reverse-engineering is performed using a small set of clean samples possessed by the defender [42], or using simulated samples obtained by model inversion [3,11].…”

Section: Backdoor Defensesmentioning

confidence: 99%

“…Existing PT defenses typically assume that the defender independently possesses a small set of clean, legitimate samples from every class. These samples may be used: i) to reverse-engineer putative BPs, which are the basis for anomaly detection [42,14,48,25,9,43,45,34,49]; or ii) to train shadow neural networks with and without (known) BAs -based on which a binary "meta-classifier" is trained to predict whether the classifier under inspection is backdoor attacked [18,51,40]. However, these methods assume the BP type (the mechanism for embedding a BP) used by the attacker is known.…”

Section: Introductionmentioning

confidence: 99%

Universal Post-Training Backdoor Detection

Wang¹,

Xiang²,

Miller³

et al. 2022

Preprint

View full text Add to dashboard Cite

A Backdoor attack (BA) is an important type of adversarial attack against deep neural network classifiers, wherein test samples from one or more source classes will be (mis)classified to the attacker's target class when a backdoor pattern (BP) is embedded. In this paper, we focus on the post-training backdoor defense scenario commonly considered in the literature, where the defender aims to detect whether a trained classifier was backdoor attacked, without any access to the training set. To the best of our knowledge, existing post-training backdoor defenses are all designed for BAs with presumed BP types, where each BP type has a specific embedding function. They may fail when the actual BP type used by the attacker (unknown to the defender) is different from the BP type assumed by the defender. In contrast, we propose a universal post-training defense that detects BAs with arbitrary types of BPs, without making any assumptions about the BP type. Our detector leverages the influence of the BA, independently of the BP type, on the landscape of the classifier's outputs prior to the softmax layer. For each class, a maximum margin statistic is estimated using a set of random vectors; detection inference is then performed by applying an unsupervised anomaly detector to these statistics. Thus, our detector is also an advance relative to most existing post-training methods by not needing any legitimate clean samples, and can efficiently detect BAs with arbitrary numbers of source classes. These advantages of our detector over several state-of-the-art methods are demonstrated on four datasets, for three different types of BPs, and for a variety of attack configurations. Finally, we propose a novel, general approach for BA mitigation once a detection is made.

show abstract

“…These methods [48], [49], [50] detect poisoned images by reversing potential triggers contained in given suspicious DNNs. They have a latent assumption that the triggers should be sample-agnostic and the attack should be targeted.…”

Section: Resistance To Trigger Synthesis Based Detectionsmentioning

confidence: 99%

MOVE: Effective and Harmless Ownership Verification via Embedded External Features

Li¹,

Zhu²,

Xing³

et al. 2022

Preprint

View full text Add to dashboard Cite

Currently, deep neural networks (DNNs) are widely adopted in different applications. Despite its commercial values, training a well-performed DNN is resource-consuming. Accordingly, the well-trained model is valuable intellectual property for its owner. However, recent studies revealed the threats of model stealing, where the adversaries can obtain a function-similar copy of the victim model, even when they can only query the model. In this paper, we propose an effective and harmless model ownership verification (MOVE) to defend against different types of model stealing simultaneously, without introducing new security risks. In general, we conduct the ownership verification by verifying whether a suspicious model contains the knowledge of defender-specified external features. Specifically, we embed the external features by tempering a few training samples with style transfer. We then train a meta-classifier to determine whether a model is stolen from the victim. This approach is inspired by the understanding that the stolen models should contain the knowledge of features learned by the victim model. In particular, we develop our MOVE method under both white-box and black-box settings to provide comprehensive model protection. Extensive experiments on benchmark datasets verify the effectiveness of our method and its resistance to potential adaptive attacks. The codes for reproducing the main experiments of our method are available at https://github.com/THUYimingLi/MOVE.

show abstract

Black-box Detection of Backdoor Attacks with Limited Information and Data

Cited by 49 publications

References 19 publications

Backdooring Explainable Machine Learning

Backdooring Explainable Machine Learning

Universal Post-Training Backdoor Detection

MOVE: Effective and Harmless Ownership Verification via Embedded External Features

Contact Info

Product

Resources

About