“…For instance, Wang et al (2021) trains a recurrent neural network that captures the difference in the logits distribution of manipulated samples. Aigrain and Detyniecki (2019), instead, achieves good detection performance by feeding a simple three-layer neural network directly with the logit activations.…”
“…Previous research showed that analyzing the model's logits leads to promising results in discriminating manipulated inputs (Wang et al, 2021;Aigrain and Detyniecki, 2019;Hendrycks and Gimpel, 2016). However, logits-based adversarial detectors have been only studied on computer vision applications.…”
Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-ofthe-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
“…For instance, Wang et al (2021) trains a recurrent neural network that captures the difference in the logits distribution of manipulated samples. Aigrain and Detyniecki (2019), instead, achieves good detection performance by feeding a simple three-layer neural network directly with the logit activations.…”
“…Previous research showed that analyzing the model's logits leads to promising results in discriminating manipulated inputs (Wang et al, 2021;Aigrain and Detyniecki, 2019;Hendrycks and Gimpel, 2016). However, logits-based adversarial detectors have been only studied on computer vision applications.…”
Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-ofthe-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.
“…In supervised detection, the defender considers AEs generated by one or more adversarial attack algorithms in designing and training the detector D. It is believed that AEs have distinguishable features that make them different from clean inputs [26], hence, defenders take this advantage to build a robust detector D. To accomplish this, many approaches have been presented in the literature. ), Circumventable [25] Softmax [80] BIM, DF M( ) Softmax [87] FGSM, BIM, DF M( ), C( ) Softmax [88] FGSM, BIM, JSMA, DF M( ), C( )…”
Section: Supervised Detectionmentioning
confidence: 99%
“…The detector D considers an input as AE if there is no match between baseline classifier and the retrained classifier. Aigrain et al [87] built a simple NN detector D which takes the baseline model logits of clean and AEs as inputs to build a binary classifier. Finally, following the hypothesis that different models make different mistakes when presented with the same attack inputs, Monteiro et al [88] proposed a bimodel mismatch detection.…”
White-box Black-box Grey-box AverageFig. 1: The average detection rate of eight detectors assessed against white-, black-and grey-box attacks scenarios. The green points represent the average over all scenarios.
“…These two family of approaches have their own limitations as the former one is more computationally expensive while the latter provides weaker defense, albeit at a lower computational overhead. Instead of making the model robust, there are also approaches to detect these attacks [26,16,48,3,47,29,43,23]. These methods often require retraining of the network [16,3,23].…”
Deep models are highly susceptible to adversarial attacks. Such attacks are carefully crafted imperceptible noises that can fool the network and can cause severe consequences when deployed. To encounter them, the model requires training data for adversarial training or explicit regularization-based techniques. However, privacy has become an important concern, restricting access to only trained models but not the training data (e.g. biometric data). Also, data curation is expensive and companies may have proprietary rights over it. To handle such situations, we propose a completely novel problem of 'test-time adversarial defense in absence of training data and even their statistics'. We solve it in two stages: a) detection and b) correction of adversarial samples. Our adversarial sample detection framework is initially trained on arbitrary data and is subsequently adapted to the unlabelled test data through unsupervised domain adaptation. We further correct the predictions on detected adversarial samples by transforming them in Fourier domain and obtaining their low frequency component at our proposed suitable radius for model prediction. We demonstrate the efficacy of our proposed technique via extensive experiments against several adversarial attacks and for different model architectures and datasets. For a non-robust Resnet-18 model pretrained on CIFAR-10, our detection method correctly identifies 91.42% adversaries. Also, we significantly improve the adversarial accuracy from 0% to 37.37% with a minimal drop of 0.02% in clean accuracy on state-of-the-art 'Auto Attack' without having to retrain the model.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.