Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems

Xu, Li; Zhong, Jinghua; Wu, Xixin; Yu, Jianwei; Liu, Xunying; Meng, Helen

doi:10.1109/icassp40776.2020.9053076

Cited by 75 publications

(88 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To illustrate the importance of live human volunteer, we performed audio replay detection using the model in [16]. We collected 45 audio adversarial examples from the previous studies [5,7,8] and 120 our physical adversarial examples. When performing physical attack, their adversarial examples had to be played by a speaker device, but our attack can be conducted with a live human adversary.…”

Section: Evaluation Of Physical Attacksmentioning

confidence: 99%

“…DNNbased ASV models [2,3,4] tend to have excellent performance, but many studies have shown that audio adversarial examples can make the ASV process give wrong decisions [5,6] or let adversary pass verification [7,8]. The transferability of audio adversarial examples across different models was also revealed in [5,6]. Audio adversarial examples could still remain effective after being played over the air in [9].…”

Section: Introductionmentioning

confidence: 99%

“…We call it the practical speaker verification (PSV) system. Previous studies [5,6,7,8,9] only consider attacking the speaker identity check module to let it break. But their adversarial examples will be rejected in the PSV system for audio replay or different speech content.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

Zhang

Zhao

Liu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In authentication scenarios, applications of practical speaker verification systems usually require a person to read a dynamic authentication text. Previous studies played an audio adversarial example as a digital signal to perform physical attacks, which would be easily rejected by audio replay detection modules. This work shows that by playing our crafted adversarial perturbation as a separate source when the adversary is speaking, the practical speaker verification system will misjudge the adversary as a target speaker. A two-step algorithm is proposed to optimize the universal adversarial perturbation to be text-independent and has little effect on the authentication text recognition. We also estimated room impulse response (RIR) in the algorithm which allowed the perturbation to be effective after being played over the air. In the physical experiment, we achieved targeted attacks with success rate of 100%, while the word error rate (WER) on speech recognition was only increased by 3.55%. And recorded audios could pass replay detection for the live person speaking.

show abstract

Section: Evaluation Of Physical Attacksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

Zhang

Zhao

Liu³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Most of the existing adversarial attacks [5] against speaker identification models exploit state-of-the-art methods originally developed for image classification, such as the Fast Gradient Sign Method (FGSM) [6] and its iterative version, Basic Iterative Method (BIM) [7]. Kreuk et al [8] and Li et al [9] explored the vulnerabilities of x-vector and i-vector based speaker verification models to FGSM adversarial attacks. Li et al [10] further integrated an estimate of room impulse responses with FGSM to generate adversarial audio files that may still be effective when played over-the-air against an x-vector based speaker recognition model.…”

mentioning

confidence: 99%

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

Shamsabadi

Teixeira

Abad

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaker identification models are vulnerable to carefully designed adversarial perturbations of their input signals that induce misclassification. In this work, we propose a white-box steganographyinspired adversarial attack that generates imperceptible adversarial perturbations against a speaker identification model. Our approach, FoolHD, uses a Gated Convolutional Autoencoder that operates in the DCT domain and is trained with a multi-objective loss function, to generate and conceal the adversarial perturbation within the original audio files. In addition to hindering speaker identification performance, this multi-objective loss accounts for human perception through a frame-wise cosine similarity between MFCC feature vectors extracted from the original and adversarial audio files. We validate the effectiveness of FoolHD with a 250-speaker identification x-vector network, trained using VoxCeleb, in terms of accuracy, success rate, and imperceptibility. Our results show that FoolHD generates highly imperceptible adversarial audio files (average PESQ scores above 4.30), while achieving a success rate of 99.6% and 99.2% in misleading the speaker identification model, for untargeted and targeted settings, respectively.

show abstract

“…Automatic speaker verification (ASV) systems aim at confirming a claimed speaker identity against a spoken utterance, which has been widely applied into commercial devices and authorization tools. However, it is also broadly noticed that malicious attacks can easily degrade a well-developed ASV system, and such attacks may be classified into impersonation [1], replay [1], voice conversion (VC) [2], text-to-speech [3] synthesis (TTS) and the recently emerged adversarial attacks [4,5].…”

Section: Introductionmentioning

confidence: 99%

Replay and Synthetic Speech Detection with Res2Net Architecture

Weng

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

106

View full text Add to dashboard Cite

Existing approaches for replay and synthetic speech detection still lack generalizability to unseen spoofing attacks. This work proposes to leverage a novel model structure, so-called Res2Net, to improve the anti-spoofing countermeasure's generalizability. Res2Net mainly modifies the ResNet block to enable multiple feature scales. Specifically, it splits the feature maps within one block into multiple channel groups and designs a residual-like connection across different channel groups. Such connection increases the possible receptive fields, resulting in multiple feature scales. This multiple scaling mechanism significantly improves the countermeasure's generalizability to unseen spoofing attacks. It also decreases the model size compared to ResNet-based models. Experimental results show that the Res2Net model consistently outperforms ResNet34 and ResNet50 by a large margin in both physical access (PA) and logical access (LA) of the ASVspoof 2019 corpus. Moreover, integration with the squeeze-and-excitation (SE) block can further enhance performance. For feature engineering, we investigate the generalizability of Res2Net combined with different acoustic features, and observe that the constant-Q transform (CQT) achieves the most promising performance in both PA and LA scenarios. Our best single system outperforms other state-of-the-art single systems in both PA and LA of the ASVspoof 2019 corpus.

show abstract

Adversarial Attacks on GMM I-Vector Based Speaker Verification Systems

Cited by 75 publications

References 25 publications

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

Attack on Practical Speaker Verification System Using Universal Adversarial Perturbations

FoolHD: Fooling Speaker Identification by Highly Imperceptible Adversarial Disturbances

Replay and Synthetic Speech Detection with Res2Net Architecture

Contact Info

Product

Resources

About