2022
DOI: 10.3390/math10213927
|View full text |Cite
|
Sign up to set email alerts
|

Residual Information in Deep Speaker Embedding Architectures

Abstract: Speaker embeddings represent a means to extract representative vectorial representations from a speech signal such that the representation pertains to the speaker identity alone. The embeddings are commonly used to classify and discriminate between different speakers. However, there is no objective measure to evaluate the ability of a speaker embedding to disentangle the speaker identity from the other speech characteristics. This means that the embeddings are far from ideal, highly dependent on the training c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 53 publications
0
2
0
Order By: Relevance
“…To the best of our knowledge, there are no studies which explore how, and if, the choice of speaker representations and their learning strategies affect the synthesised output in multi-speaker TTS. One study which performed a related task on understanding what the SV-derived embeddings encompass is [17]. The study introduced an evaluation over six freely available neural speaker embedding architectures and the extent to which they encompass residual information related to other speech factors, such as F0, duration, signal-to-noise ratio, speaker gender, and linguistic content.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…To the best of our knowledge, there are no studies which explore how, and if, the choice of speaker representations and their learning strategies affect the synthesised output in multi-speaker TTS. One study which performed a related task on understanding what the SV-derived embeddings encompass is [17]. The study introduced an evaluation over six freely available neural speaker embedding architectures and the extent to which they encompass residual information related to other speech factors, such as F0, duration, signal-to-noise ratio, speaker gender, and linguistic content.…”
Section: Related Workmentioning
confidence: 99%
“…The choice for the TitaNet architecture is based on the results of [17] where multiple speaker embedding models were jointly analysed in an effort to determine the amount of residual information present within them. The TitaNet-derived embeddings showed some of the best performances.…”
Section: Speaker Embedding Sets and Multi-speaker Modelsmentioning
confidence: 99%