A Survey of Neural Trojan Attacks and Defenses in Deep Learning

Wang, Jie; Hassan, Ghulam Mubashar; Akhtar, Naveed

doi:10.48550/arxiv.2202.07183

Cited by 4 publications

(6 citation statements)

References 86 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…( 4) Adversarial examples can be used as interpretability tools [43], [67], [117], [241]. (5) Finally, adversarial trojan detection methods can also be used as interpretability/debugging tools [90], [98], [156], [252], [253]. 1 The works referenced in this paragraph are not limited only to inner interpretability methods.…”

Section: Discussionmentioning

confidence: 99%

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman¹,

Ho²,

Casper³

et al. 2022

Preprint

View full text Add to dashboard Cite

The last decade of machine learning has seen drastic increases in scale and capabilities, and deep neural networks (DNNs) are increasingly being deployed across a wide range of domains. However, the inner workings of DNNs are generally difficult to understand, raising concerns about the safety of using these systems without a rigorous understanding of how they function. In this survey, we review literature on techniques for interpreting the inner components of DNNs, which we call inner interpretability methods. Specifically, we review methods for interpreting weights, neurons, subnetworks, and latent representations with a focus on how these techniques relate to the goal of designing safer, more trustworthy AI systems. We also highlight connections between interpretability and work in modularity, adversarial robustness, continual learning, network compression, and studying the human visual system. Finally, we discuss key challenges and argue for future work in interpretability for AI safety that focuses on diagnostics, benchmarking, and robustness.

show abstract

Section: Discussionmentioning

confidence: 99%

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman¹,

Ho²,

Casper³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The authors suggested a defense mechanism that works by fine-tuning the model on a variety of clean datasets, and they showed that it works on numerous benchmark datasets. Wang et al [81] examined the field of neural trojan attacks. The authors provide an overview of current attack techniques and defense tactics and discuss the significance of creating reliable models to prevent trojan attacks.…”

Section: Susceptibility Of Deep Learning Systems To Backdoor Attacks ...mentioning

confidence: 99%

Backdoor Attacks to Deep Neural Networks: A Survey of the Literature, Challenges, and Future Research Directions

Mengara,

Avila,

Falk

2024

IEEE Access

View full text Add to dashboard Cite

Deep neural network (DNN) classifiers are potent instruments that can be used in various security-sensitive applications. Nonetheless, they are vulnerable to certain attacks that impede or distort their learning process. For example, backdoor attacks involve polluting the DNN learning set with a few samples from one or more source classes, which are then labelled as a target class set by an attacker. Even if the DNN is trained on clean samples with no backdoors, this attack will still be successful if a backdoor pattern exists in the training data. Backdoor attacks are difficult to spot and can be used to make the DNN behave maliciously depending on the target selected by the attacker. In this study, we survey the literature and highlight the latest advances in backdoor attack strategies and defense mechanisms. We finalize the discussion on challenges and open issues, as well as future research opportunities.

show abstract

“…Backdoor (a.k.a. Trojan) attacks manipulate visual models by forcing them to misbehave when exposed to a 'trigger' in the input (Wang, Hassan, and Akhtar 2022). These attacks are stealthy because the model behaves normally for clean inputs, and the model user is unaware of the trigger pattern.…”

Section: Backdoor Detectionmentioning

confidence: 99%

“…In Fig. 4, we show the trigger patterns used in our experiments, which are chosen at random based on the literature (Wang, Hassan, and Akhtar 2022). We apply the proposed input-agnostic saliency mapping to the compromised models.…”

Section: Backdoor Detectionmentioning

confidence: 99%

“…This can lead to many interesting applications. To illustrate one, this work also presents a case study to detect Trojan trigger patterns in compromised classifiers that contain backdoors (Wang, Hassan, and Akhtar 2022). We leverage our input-agnostic visualization to map the geometric patterns to which the classifier's output nodes are more sensitive.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Rethinking Interpretation: Input-Agnostic Saliency Mapping of Deep Visual Classifiers

Akhtar

Jalwana

2023

AAAI

View full text Add to dashboard Cite

Saliency methods provide post-hoc model interpretation by attributing input features to the model outputs. Current methods mainly achieve this using a single input sample, thereby failing to answer input-independent inquiries about the model. We also show that input-specific saliency mapping is intrinsically susceptible to misleading feature attribution. Current attempts to use `general' input features for model interpretation assume access to a dataset containing those features, which biases the interpretation. Addressing the gap, we introduce a new perspective of input-agnostic saliency mapping that computationally estimates the high-level features attributed by the model to its outputs. These features are geometrically correlated, and are computed by accumulating model's gradient information with respect to an unrestricted data distribution. To compute these features, we nudge independent data points over the model loss surface towards the local minima associated by a human-understandable concept, e.g., class label for classifiers. With a systematic projection, scaling and refinement process, this information is transformed into an interpretable visualization without compromising its model-fidelity. The visualization serves as a stand-alone qualitative interpretation. With an extensive evaluation, we not only demonstrate successful visualizations for a variety of concepts for large-scale models, but also showcase an interesting utility of this new form of saliency mapping by identifying backdoor signatures in compromised classifiers.

show abstract

A Survey of Neural Trojan Attacks and Defenses in Deep Learning

Cited by 4 publications

References 86 publications

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Backdoor Attacks to Deep Neural Networks: A Survey of the Literature, Challenges, and Future Research Directions

Rethinking Interpretation: Input-Agnostic Saliency Mapping of Deep Visual Classifiers

Contact Info

Product

Resources

About