Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1712
|View full text |Cite
|
Sign up to set email alerts
|

Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation

Abstract: In speech technologies, speaker's voice representation is used in many applications such as speech recognition, voice conversion, speech synthesis and, obviously, user authentication. Modern vocal representations of the speaker are based on neural embeddings. In addition to the targeted information, these representations usually contain sensitive information about the speaker, like the age, sex, physical state, education level or ethnicity. In order to allow the user to choose which information to protect, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(7 citation statements)
references
References 31 publications
0
7
0
Order By: Relevance
“…SplitDim-Adv models were only successful when training from scratch or by initializing using another SplitDim-Adv model (as was the case when fine-tuning SplitDim-Adv from Vox-Celeb to SCOTUS). This could explain the findings of [21,20], In Table 2, the speaker verification performance on Vox-Celeb is shown for the baseline model, along with the SplitDim-Adv model. Firstly, we can see that disentangling the space has incurred a reduction in performance (4.22% to 6.68% EER), which is likely due to the addition of the four extra tasks of the SplitDim-Adv model (Gender, Gender-Adversary, Nationality, Nationality-Adversary).…”
Section: Resultsmentioning
confidence: 55%
See 1 more Smart Citation
“…SplitDim-Adv models were only successful when training from scratch or by initializing using another SplitDim-Adv model (as was the case when fine-tuning SplitDim-Adv from Vox-Celeb to SCOTUS). This could explain the findings of [21,20], In Table 2, the speaker verification performance on Vox-Celeb is shown for the baseline model, along with the SplitDim-Adv model. Firstly, we can see that disentangling the space has incurred a reduction in performance (4.22% to 6.68% EER), which is likely due to the addition of the four extra tasks of the SplitDim-Adv model (Gender, Gender-Adversary, Nationality, Nationality-Adversary).…”
Section: Resultsmentioning
confidence: 55%
“…The topic of disentangled speaker representations is also closely linked with the field of voice privacy [17,18,19], wherein certain attributes are desirable to obscure in speaker embeddings to protect against malicious attackers. Notably, the work of [20] used adversarial training to control the gender element of an auto-encoder architecture, seeking to be able to control that element and therefore provide gender-invariant representations. A follow up paper [21] utilized normalizing flows to again obscure the gender information in speaker embeddings, finding this to be an improvement over the adversarial method.…”
Section: Related Workmentioning
confidence: 99%
“…Later studies focused on removing specific speaker attributes instead of general speaker identity information. For example, Noé et al [18] used adversarial training to preserve the privacy of speakers' gender information in an automatic speaker verification system. While this study investigated a single attribute, in practice, there may be a need to conceal multiple types of information.…”
Section: B Adversarial Privacy-preserving Representations In Speech R...mentioning
confidence: 99%
“…Gender information is typically used to condition models preserving the identity of a speaker. However, only a handful of methods explicitly consider gender as a sensitive attribute to protect [7,8,9,10,11]. A hybrid model combining Variational Autoencoders and Generative Adversarial Networks (GANs) can be used to protect gender information through voice conversion with a disentanglement approach targeted for the speech recognition task [7].…”
Section: Introductionmentioning
confidence: 99%
“…Two encoders are trained to independently encode content and speaker identity information that is then used to hide (or mask) gender information. Privacy methods that operate at feature-level have been used to disentangle gender information from x-vectors [12] with adversarial learning and an encoder-decoder based architecture [8]. Because this adversarial method removes the unwanted information at the level of the feature representation instead of the speech waveform, it is not useful for tasks such as speech recognition.…”
Section: Introductionmentioning
confidence: 99%