Residual Information in Deep Speaker Embedding Architectures

Stan, Adriana

doi:10.3390/math10213927

Cited by 2 publications

(2 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To the best of our knowledge, there are no studies which explore how, and if, the choice of speaker representations and their learning strategies affect the synthesised output in multi-speaker TTS. One study which performed a related task on understanding what the SV-derived embeddings encompass is [17]. The study introduced an evaluation over six freely available neural speaker embedding architectures and the extent to which they encompass residual information related to other speech factors, such as F0, duration, signal-to-noise ratio, speaker gender, and linguistic content.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

Stan,

O'Mahony

2023

12th ISCA Speech Synthesis Workshop (SSW2023)

View full text Add to dashboard Cite

In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training procedures adopted thus far.

show abstract

Section: Related Workmentioning

confidence: 99%

“…The choice for the TitaNet architecture is based on the results of [17] where multiple speaker embedding models were jointly analysed in an effort to determine the amount of residual information present within them. The TitaNet-derived embeddings showed some of the best performances.…”

Section: Speaker Embedding Sets and Multi-speaker Modelsmentioning

confidence: 99%