In Machine Learning as a Service, a provider trains a deep neural network and provides many users access to it. However, the hosted (source) model is susceptible to model stealing attacks where an adversary derives a surrogate model from API access to the source model. For post hoc detection of such attacks, the provider needs a robust method to determine whether a suspect model is a surrogate of their model or not. We propose a fingerprinting method for deep neural networks that extracts a set of inputs from the source model so that only surrogates agree with the source model on the classification of such inputs. These inputs are a specifically crafted subclass of targeted transferable adversarial examples which we call conferrable adversarial examples that transfer exclusively from a source model to its surrogates. We propose new methods to generate these conferrable adversarial examples and use them as our fingerprint. Our fingerprint is the first to be successfully tested as robust against distillation attacks, and our experiments show that this robustness extends to robustness against weaker removal attacks such as fine-tuning, ensemble attacks, adversarial training and stronger adaptive attacks specifically designed against our fingerprint. We even protect against a powerful adversary with white-box access to the source model, whereas the defender only needs black-box access to the surrogate model. We conduct our experiments on the CINIC dataset, which is a superset of CIFAR-10, and a subset of ImageNet32 with 100 classes. Our experiments show that our fingerprint perfectly separates surrogate and reference models. We measure a fingerprint retention of 100% in all evaluated attacks for surrogate models that have at most a difference in test accuracy of five percentage points to the source model.
Deep image classification models trained on large amounts of webscraped data are vulnerable to data poisoning, a mechanism for backdooring models. Even a few poisoned samples seen during training can entirely undermine the model's integrity during inference. While it is known that poisoning more samples enhances an attack's effectiveness and robustness, it is unknown whether poisoning too many samples weakens an attack by making it more detectable. We observe a fundamental detectability/robustness tradeoff in data poisoning attacks: Poisoning too few samples renders an attack ineffective and not robust, but poisoning too many samples makes it detectable. This raises the bar for data poisoning attackers who have to balance this trade-off to remain robust and undetectable. Our work proposes two defenses designed to (i) detect and (ii) repair poisoned models as a post-processing step after training using a limited amount of trusted image-label pairs. We show that our defenses mitigate all surveyed attacks and outperform existing defenses using less trusted data to repair a model. Our defense scales to joint vision-language models, such as CLIP, and interestingly, we find that attacks on larger models are more easily detectable but also more robust than those on smaller models. Lastly, we propose two adaptive attacks demonstrating that while our work raises the bar for data poisoning attacks, it cannot mitigate all forms of backdooring.
CCS CONCEPTS• Security and privacy → Software and application security; • Computing methodologies → Machine learning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.