Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

Carlini, Nicholas; Wagner, David

doi:10.1109/spw.2018.00009

Cited by 817 publications

(862 citation statements)

References 41 publications

Supporting

Mentioning

809

Contrasting

Order By: Relevance

“…One of the most successful white-box attacking methods is C&W [9]. This method uses Connectionist Temporal Classification (CTC) loss function [26] for perturbation optimization.…”

Section: Attacking Methodsmentioning

confidence: 99%

“…Earlier adversarial attacks were applied on machine learning models of image domain [4,5,6,7,8] and then these attacking methods have been spread out onto other domains, e.g. speech signals [9,10,11,12,13]. The adversary adds a very small optimized perturbation, which is not detectable by human, to a legitimate input and generates an adversarial example that results the learning model to return a wrong output.…”

Section: Introductionmentioning

confidence: 99%

“…The adversary's level of access to the victim learning model categorizes attacking methods in two different types: 1) white-box where the adversary has full access to the layers and parameters of the victim learning model, and 2) black-box where the adversary has no access to these. The state-of-the-art white-box attack in the speech domain is Carlini & Wagner (C&W) method [9] that is a gradient-based method using iterative optimization and has achieved 100% success rate in their experiment. The similarity between normal and corresponding adversarial examples is 99% when the victim model is Baidu DeepSpeech [16] and the dataset is Mozilla Common Voice dataset [17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Adversarial Example Detection by Classification for Deep Speech Recognition

Samizade

Shen

Guan

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Machine Learning systems are vulnerable to adversarial attacks and will highly likely produce incorrect outputs under these attacks. There are white-box and black-box attacks regarding to adversarys access level to the victim learning algorithm. To defend the learning systems from these attacks, existing methods in the speech domain focus on modifying input signals and testing the behaviours of speech recognizers. We, however, formulate the defense as a classification problem and present a strategy for systematically generating adversarial example datasets: one for white-box attacks and one for black-box attacks, containing both adversarial and normal examples. The white-box attack is a gradient-based method on Baidu DeepSpeech with the Mozilla Common Voice database while the black-box attack is a gradient-free method on a deep model-based keyword spotting system with the Google Speech Command dataset. The generated datasets are used to train a proposed Convolutional Neural Network (CNN), together with cepstral features, to detect adversarial examples. Experimental results show that, it is possible to accurately distinct between adversarial and normal examples for known attacks, in both single-condition and multi-condition training settings, while the performance degrades dramatically for unknown attacks. The adversarial datasets and the source code are made publicly available.

show abstract

“…One of the most successful white-box attacking methods is C&W [9]. This method uses Connectionist Temporal Classification (CTC) loss function [26] for perturbation optimization.…”

Section: Attacking Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Adversarial Example Detection by Classification for Deep Speech Recognition

Samizade

Shen

Guan

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Thus, simply playing the pre-generated universal perturbation nearby the victim speaker becomes possible for launching adversarial attacks. For showing the possibility of launching real-time attacks, we compare the attack launching time of using the conventional individual targeted attack method [6] and our proposed universal attack for a given audio signal. Particularly, the conventional targeted attack requires at least 15s to deploy, measured on a Tesla V100 GPU with 32GB memory, while our proposed universal method only takes an average of 0.015s, which results in a 100× speedup.…”

Section: Attack Evaluationmentioning

confidence: 99%

Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems

Xie

Shi

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

As the popularity of voice user interface (VUI) exploded in recent years, speaker recognition system has emerged as an important medium of identifying a speaker in many security-required applications and services. In this paper, we propose the first real-time, universal, and robust adversarial attack against the state-of-the-art deep neural network (DNN) based speaker recognition system. Through adding an audio-agnostic universal perturbation on arbitrary enrolled speaker's voice input, the DNN-based speaker recognition system would identify the speaker as any target (i.e., adversary-desired) speaker label. In addition, we improve the robustness of our attack by modeling the sound distortions caused by the physical over-the-air propagation through estimating room impulse response (RIR). Experiment using a public dataset of 109 English speakers demonstrates the effectiveness and robustness of our proposed attack with a high attack success rate of over 90%. The attack launching time also achieves a 100× speedup over contemporary non-universal attacks.

show abstract

“…Carlini & Wagner, 2018). While we would not expect this to be a widespread problem in typical online experimental settings, researchers are also beginning to devise strategies for counteracting adversarial examples (Madry et al, 2017).…”

Section: Speech-to-text Engines As a Driver For Scalable Online Verbamentioning

confidence: 99%

Is automatic speech-to-text transcription ready for use in psychological experiments?

et al. 2018

View full text Add to dashboard Cite

Verbal responses are a convenient and naturalistic way for participants to provide data in psychological experiments (Salzinger, 1959). However, audio recordings of verbal responses typically require additional processing, such as transcribing the recordings into text, as compared with other behavioral response modalities (e.g. typed responses, button presses, etc.). Further, the transcription process is often tedious and time-intensive, requiring human listeners to manually examine each moment of recorded speech. Here we evaluate the performance of a state-of-the-art speech recognition algorithm (Halpern et al., 2016) in transcribing audio data into text during a list-learning experiment. We compare transcripts made by human annotators to the computer-generated transcripts. Both sets of transcripts matched to a high degree and exhibited similar statistical properties, in terms of the participants' recall performance and recall dynamics that the transcripts captured. This proof-of-concept study suggests that speech-to-text engines could provide a cheap, reliable, and rapid means of automatically transcribing speech data in psychological experiments. Further, our findings open the door for verbal response experiments that scale to thousands of participants (e.g. administered online), as well as a new generation of experiments that decode speech on-the-fly and adapt experimental parameters based on participants' prior responses.

show abstract

Audio Adversarial Examples: Targeted Attacks on Speech-to-Text

Cited by 817 publications

References 41 publications

Adversarial Example Detection by Classification for Deep Speech Recognition

Adversarial Example Detection by Classification for Deep Speech Recognition

Real-Time, Universal, and Robust Adversarial Attacks Against Speaker Recognition Systems

Is automatic speech-to-text transcription ready for use in psychological experiments?

Contact Info

Product

Resources

About