ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2019
DOI: 10.1109/icassp.2019.8683282
|View full text |Cite
|
Sign up to set email alerts
|

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Abstract: This paper describes a method based on a sequenceto-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not on… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
93
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 111 publications
(97 citation statements)
references
References 51 publications
0
93
0
Order By: Relevance
“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”
Section: Modeling Time Dependencies With the Fcn Structurementioning
confidence: 93%
“…Note that here the output of our model is still of the same length as the input. Although sequence to sequence based models, which can generate output sequences of variable length, have been successfully applied to VC [20,21,22,23,24], we will show that only considering temporal dependencies can bring significant improvements to VAE-VC.…”
Section: Modeling Time Dependencies With the Fcn Structurementioning
confidence: 93%
“…In this paper we propose an end-to-end architecture that directly generates the target signal, synthesizing it from scratch. It is most similar to recent work on sequence-tosequence voice conversion [16][17][18]. [16] uses a similar end-toend model, conditioned on speaker identities, to transform word segments from multiple speakers into multiple target voices.…”
Section: Introductionmentioning
confidence: 98%
“…However, we do find it helpful to multitask train the model to predict source speech phonemes. Finally, in contrast to [18], we train the model without auxiliary alignment or auto-encoding losses.…”
Section: Introductionmentioning
confidence: 99%
“…For both parallel and non-parallel VC, the systems usually change the voice but are unable to change the duration of the utterance. Many recent researches focused on converting speaking rate along with voices by using sequence-to-sequence models [11,16,17,18], as speaking rate is also a speaker characteristic.…”
Section: Introductionmentioning
confidence: 99%