A comparative study of voice conversion techniques: A review

Ezzine, Kadria; Frikha, Mounir

doi:10.1109/atsip.2017.8075528

Cited by 3 publications

(5 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Voice conversion (VC) [1] aims to modify a speech signal uttered by a source speaker to sound as if it is uttered by a target speaker while retaining the linguistic information. Various approaches have been proposed [2] for voice conversion.…”

Section: Introductionmentioning

confidence: 99%

“…Various approaches have been proposed [2] for voice conversion. As parallel data [1] is expensive to collect, non-parallel methods [3,4,5] have received significant attention. Among them, phonetic posteriorgram (PPG) [5] based method is one of the most popular implementations.…”

Section: Introductionmentioning

confidence: 99%

“…Despite recent progress, modeling prosody from expressive speech [10] for style transfer with voice conversion framework is still a challenging task. Besides linguistic information, transferring the source prosody to the target is vital for many voice conversion tasks, including automatic dubbing 1 for movies in which conversations are emotional in nature. Modeling prosody is not a trivial task; furthermore, it is necessary to remove speaker and content related information from prosody representation.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Gan¹,

Wen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Prosody modeling is important, but still challenging in expressive voice conversion. As prosody is difficult to model, and other factors, e.g., speaker, environment and content, which are entangled with prosody in speech, should be removed in prosody modeling. In this paper, we present IQDubbing to solve this problem for expressive voice conversion. To model prosody, we leverage the recent advances in discrete self-supervised speech representation (DSSR). Specifically, prosody vector is first extracted from pre-trained VQ-Wav2Vec model, where rich prosody information is embedded while most speaker and environment information are removed effectively by quantization. To further filter out the redundant information except prosody, such as content and partial speaker information, we propose two kinds of prosody filters to sample prosody from the prosody vector. Experiments show that IQDubbing is superior to baseline and comparison systems in terms of speech quality while maintaining prosody consistency and speaker similarity.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Gan¹,

Wen²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to the extensive use of the esophageal voice by laryngectomees, this type of voice has been the subject of numerous studies in the last few years. To our knowledge, the existing approaches for ES quality improvements can be summarized into three categories: approaches based on the transformation of acoustic features, such as formant synthesis [4], comb filtering [5], and smoothing of acoustic parameters [6]; approaches based on statistical techniques, where [7][8][9] have been carried out, and approaches based on the VC technique, which allows for the transformation of the voice of a source speaker (laryngectomee) into that of a target speaker (laryngeal) [10][11][12][13][14][15][16]. Although these approaches have of course improved the estimation of the acoustic characteristics to reconstruct a converted signal with better quality, the improvements in intelligibility and naturalness are still insufficient.…”

Section: Introductionmentioning

confidence: 99%

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

2022

Self Cite

View full text Add to dashboard Cite

Laryngectomees are individuals whose larynx has been surgically removed, usually due to laryngeal cancer. The immediate consequence of this operation is that these individuals (laryngectomees) are unable to speak. Esophageal speech (ES) remains the preferred alternative speaking method for laryngectomees. However, compared to the laryngeal voice, ES is characterized by low intelligibility and poor quality due to chaotic fundamental frequency F0, specific noises, and low intensity. Our proposal to solve these problems is to take advantage of voice conversion as an effective way to improve speech quality and intelligibility. To this end, we propose in this work a novel esophageal–laryngeal voice conversion (VC) system based on a sequence-to-sequence (Seq2Seq) model combined with an auditory attention mechanism. The originality of the proposed framework is that it adopts an auditory attention technique in our model, which leads to more efficient and adaptive feature mapping. In addition, our VC system does not require the classical DTW alignment process during the learning phase, which avoids erroneous mappings and significantly reduces the computational time. Moreover, to preserve the identity of the target speaker, the excitation and phase coefficients are estimated by querying a binary search tree. In experiments, objective and subjective tests confirmed that the proposed approach performs better even in some difficult cases in terms of speech quality and intelligibility.

show abstract

“…For decades, the voice has attracted considerable attention from researchers. In speech processing, several areas emerge, such as spoken language recognition [13], automatic speech recognition [14], speaker verification [3], emotion recognition [22], speech understanding [11], voice transformation [21] or conversion [5]. Research efforts in this quite diverse list of areas share one common trait, in terms of the raw material being worked on: most focus on natural voice recordings -spontaneous or read speech, telephone recordings, or speech resulting from human-machine dialogues (through, for example, voice assistants).…”

Section: Introductionmentioning

confidence: 99%

Assessing Speaker-Independent Character Information for Acted Voices

Quillot

Dufour

Bonastre

2021

Speech and Computer

View full text Add to dashboard Cite

While the natural voice is spontaneously generated by people, the acted voice is a controlled vocal interpretation, produced by professional actors and aimed at creating a desired effect on the listener. In this work, we pay attention to the aspects of the voice related to the character played. We particularly focus on actors playing the same video game role in different languages. This article is based on a recent work which proposes to build a neural-network-based voice representation dedicated to the character aspects, namely p-vector. This representation is learnt from recordings only labeled with the acted character. It showed its ability to associate two vocal examples related to the same character, even if the character is unknown during the training phase. However, there is still a possible confusion between speaker and character dimension. To tackle this problem, We propose a protocol to highlight the speaker-independent part of the character information (SICI). We compare the original voice representation with an alternative where the information relating to the characters is neutralised. This experiment shows that performance is not a sufficient metric to assess the quality of a character representation. It also offers the first evidence of the SICI in the voice.

show abstract

A comparative study of voice conversion techniques: A review

Cited by 3 publications

References 33 publications

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

IQDUBBING: Prosody modeling based on discrete self-supervised speech representation for expressive voice conversion

Intelligibility Improvement of Esophageal Speech Using Sequence-to-Sequence Voice Conversion with Auditory Attention

Assessing Speaker-Independent Character Information for Acted Voices

Contact Info

Product

Resources

About