Detecting Adversarial Examples and Other Misclassifications in Neural Networks by Introspection

Aigrain, Jonathan; Detyniecki, Marcin

doi:10.48550/arxiv.1905.09186

Cited by 4 publications

(7 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, Wang et al (2021) trains a recurrent neural network that captures the difference in the logits distribution of manipulated samples. Aigrain and Detyniecki (2019), instead, achieves good detection performance by feeding a simple three-layer neural network directly with the logit activations.…”

Section: Logits-based Adversarial Detectorsmentioning

confidence: 99%

See 1 more Smart Citation

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

Mosca¹,

Agarwal²,

Rando-Ramirez³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-ofthe-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

show abstract

Section: Logits-based Adversarial Detectorsmentioning

confidence: 99%

“…Previous research showed that analyzing the model's logits leads to promising results in discriminating manipulated inputs (Wang et al, 2021;Aigrain and Detyniecki, 2019;Hendrycks and Gimpel, 2016). However, logits-based adversarial detectors have been only studied on computer vision applications.…”

Section: Introductionmentioning

confidence: 99%

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

Mosca¹,

Agarwal²,

Rando-Ramirez³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

show abstract

“…In supervised detection, the defender considers AEs generated by one or more adversarial attack algorithms in designing and training the detector D. It is believed that AEs have distinguishable features that make them different from clean inputs [26], hence, defenders take this advantage to build a robust detector D. To accomplish this, many approaches have been presented in the literature. ), Circumventable [25] Softmax [80] BIM, DF M( ) Softmax [87] FGSM, BIM, DF M( ), C( ) Softmax [88] FGSM, BIM, JSMA, DF M( ), C( )…”

Section: Supervised Detectionmentioning

confidence: 99%

“…The detector D considers an input as AE if there is no match between baseline classifier and the retrained classifier. Aigrain et al [87] built a simple NN detector D which takes the baseline model logits of clean and AEs as inputs to build a binary classifier. Finally, following the hypothesis that different models make different mistakes when presented with the same attack inputs, Monteiro et al [88] proposed a bimodel mismatch detection.…”

Section: Auxiliary Model Approachmentioning

confidence: 99%

Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

Aldahdooh,

Hamidouche,

Fezza

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…These two family of approaches have their own limitations as the former one is more computationally expensive while the latter provides weaker defense, albeit at a lower computational overhead. Instead of making the model robust, there are also approaches to detect these attacks [26,16,48,3,47,29,43,23]. These methods often require retraining of the network [16,3,23].…”

Section: Introductionmentioning

confidence: 99%

DAD: Data-free Adversarial Defense at Test Time

Nayak

Ruchit

Chakraborty

2022

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

Deep models are highly susceptible to adversarial attacks. Such attacks are carefully crafted imperceptible noises that can fool the network and can cause severe consequences when deployed. To encounter them, the model requires training data for adversarial training or explicit regularization-based techniques. However, privacy has become an important concern, restricting access to only trained models but not the training data (e.g. biometric data). Also, data curation is expensive and companies may have proprietary rights over it. To handle such situations, we propose a completely novel problem of 'test-time adversarial defense in absence of training data and even their statistics'. We solve it in two stages: a) detection and b) correction of adversarial samples. Our adversarial sample detection framework is initially trained on arbitrary data and is subsequently adapted to the unlabelled test data through unsupervised domain adaptation. We further correct the predictions on detected adversarial samples by transforming them in Fourier domain and obtaining their low frequency component at our proposed suitable radius for model prediction. We demonstrate the efficacy of our proposed technique via extensive experiments against several adversarial attacks and for different model architectures and datasets. For a non-robust Resnet-18 model pretrained on CIFAR-10, our detection method correctly identifies 91.42% adversaries. Also, we significantly improve the adversarial accuracy from 0% to 37.37% with a minimal drop of 0.02% in clean accuracy on state-of-the-art 'Auto Attack' without having to retrain the model.

show abstract

Detecting Adversarial Examples and Other Misclassifications in Neural Networks by Introspection

Cited by 4 publications

References 0 publications

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

“That Is a Suspicious Reaction!”: Interpreting Logits Variation to Detect NLP Adversarial Attacks

Adversarial Example Detection for DNN Models: A Review and Experimental Comparison

DAD: Data-free Adversarial Defense at Test Time

Contact Info

Product

Resources

About