To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers

Pal, Bijeeta; Tople, Shruti

doi:10.48550/arxiv.2001.02438

Cited by 5 publications

(5 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, Abdelkader et al [39] found that using a known feature extractor (i.e., pre-trained model) exposes a fine-tuned model to powerful attacks that can be executed without knowledge of the classifier head at all. Recently, Pal and Tople [62] exploited unintended features learnt in the pre-trained model to generate adversarial examples for fine-tuned models, which achieves a high attack success rate in text prediction task domain.…”

Section: A Adversarial Attacks In Transfer Learningmentioning

confidence: 99%

Smart App Attack: Hacking Deep Learning Models in Android Apps

Huang¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

On-device deep learning is rapidly gaining popularity in mobile applications. Compared to offloading deep learning from smartphones to the cloud, on-device deep learning enables offline model inference while preserving user privacy. However, such mechanisms inevitably store models on users' smartphones and may invite adversarial attacks as they are accessible to attackers. Due to the characteristic of the on-device model, most existing adversarial attacks cannot be directly applied for ondevice models. In this paper, we introduce a grey-box adversarial attack framework to hack on-device models by crafting highly similar binary classification models based on identified transfer learning approaches and pre-trained models from TensorFlow Hub. We evaluate the attack effectiveness and generality in terms of four different settings including pre-trained models, datasets, transfer learning approaches and adversarial attack algorithms. The results demonstrate that the proposed attacks remain effective regardless of different settings, and significantly outperform state-of-the-art baselines. We further conduct an empirical study on real-world deep learning mobile apps collected from Google Play. Among 53 apps adopting transfer learning, we find that 71.7% of them can be successfully attacked, which includes popular ones in medicine, automation, and finance categories with critical usage scenarios. The results call for the awareness and actions of deep learning mobile app developers to secure the on-device models. The code of this work is available at https://github.com/Jinxhy/SmartAppAttack.

show abstract

Section: A Adversarial Attacks In Transfer Learningmentioning

confidence: 99%

Smart App Attack: Hacking Deep Learning Models in Android Apps

Huang¹,

Chen²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Connection to Adversarial Examples Adversarial examples are minimally edited inputs that cause models to incorrectly change their predictions despite no change in true label (Jia and Liang, 2017;Ebrahimi et al, 2018;Pal and Tople, 2020). Recent methods for generating adversarial examples also preserve fluency Li et al, 2020b;Song et al, 2020) (Iyyer et al, 2018) or word replacement (Alzantot et al, 2018;Ren et al, 2019;Garg and Ramakrishnan, 2020), cannot be used to generate contrastive edits.…”

Section: Counterfactuals Beyond Explanations Concurrent Work Bymentioning

confidence: 99%

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Ross¹,

Marasovi²,

Peters³

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case).Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present MIN-IMAL CONTRASTIVE EDITING (MICE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks-binary sentiment classification, topic classification, and multiple-choice question answering-show that MICE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MICE edits can be used for two use cases in NLP system development-debugging incorrect model outputs and uncovering dataset artifacts-and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.

show abstract

“…Adversarial examples are minimally edited inputs that cause models to incorrectly change their predictions (Jia and Liang, 2017;Ebrahimi et al, 2018;Pal and Tople, 2020). While recent work on generating adversarial examples has also focused on preserving semantic coherence and meaning (Ribeiro et al, 2018;Ren et al, 2019;Garg and Ramakrishnan, 2020;Li et al, 2020;Song et al, 2020), the goal of adversarial examples differs from the goal of generating contrastive edits in that adversarial examples are expected not to change true labels such that changes indicate erroneous model behavior, while contrastive edits place no such constraint on the correctness of model output.…”

Section: Adversarial Examplesmentioning

confidence: 99%

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Ross¹,

Marasović²,

Peters³

2020

Preprint

View full text Add to dashboard Cite

Humans give contrastive explanations that explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the important role that contrastivity plays in how people generate and evaluate explanations, this property is largely missing from current methods for explaining NLP models. We present MINIMAL CONTRASTIVE EDITING (MICE), a method for generating contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks-binary sentiment classification, topic classification, and multiplechoice question answering-show that MICE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MICE edits can be used for two use cases in NLP system developmentuncovering dataset artifacts and debugging incorrect model predictions-and thereby illustrate that generating contrastive explanations is a promising research direction for model interpretability.

show abstract

To Transfer or Not to Transfer: Misclassification Attacks Against Transfer Learned Text Classifiers

Cited by 5 publications

References 18 publications

Smart App Attack: Hacking Deep Learning Models in Android Apps

Smart App Attack: Hacking Deep Learning Models in Android Apps

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Contact Info

Product

Resources

About