Transmutative voice conversion

Mohammadi, Seyed Hamidreza; Kain, Alexander

doi:10.1109/icassp.2013.6639003

Cited by 9 publications

(4 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate the two approaches by imposing the F0 contours generated by the two approaches onto recorded natural speech, thereby ensuring that the comparison strictly focused on the quality of the F0 contours and is not affected by other aspects of the synthesis process [27]. To ensure that the F0 contours are properly aligned with the phonetic segment boundaries of the natural utterance, the contours are time warped so that the predicted phonetic segment boundaries correspond to the segment boundaries of the natural utterance.…”

Section: Discussionmentioning

confidence: 99%

Foot-based intonation for text-to-speech synthesis using neural networks

Langarani

Santen²

2016

Speech Prosody 2016

View full text Add to dashboard Cite

We propose a method ("FONN") for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN ("DRIFT") by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

show abstract

Section: Discussionmentioning

confidence: 99%

Foot-based intonation for text-to-speech synthesis using neural networks

Langarani

Santen²

2016

Speech Prosody 2016

View full text Add to dashboard Cite

show abstract

“…Therefore, the statistical averaging effect, which reflects the central tendency of speech features, could introduce oversmoothing [24,34,35]. Frequency warping methods take the physical principles into consideration and aim to warp the frequency axis of the amplitude spectrum to the source speaker to match that of the target speaker [36][37][38][39][40][41]. In this way, the frequency warping methods are able to keep more spectral details and produce high-quality converted speech.…”

Section: Spectral Mappingmentioning

confidence: 99%

Voice conversion versus speaker verification: an overview

2014

SIP

View full text Add to dashboard Cite

A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.

show abstract

“…The second type of FW method defines the warping function by a sequence of aligned frequency axis pairs. Dynamic frequency warping (DFW) technique was proposed in [12,13,14] to minimize the spectral distance between the source and target spectra. This method operates on the high-dimensional spectral feature directly and is able to achieve low spectral distortion.…”

Section: Introductionmentioning

confidence: 99%

“…However, the conversion quality is moderate because the slopes of spectra are not considered. In [15,16,14], lowdimensional spectral features representing the formant positions, were used to train the FW functions. A combination of statistical method and FW method was proposed in [17].…”

Section: Introductionmentioning

confidence: 99%

Correlation-based frequency warping for voice conversion

Tian

Lee

et al. 2014

The 9th International Symposium on Chinese Spoken Language Processing

View full text Add to dashboard Cite

Frequency warping (FW) based voice conversion aims to modify the frequency axis of source spectra towards that of the target. In previous works, the optimal warping function was calculated by minimizing the spectral distance of converted and target spectra without considering the spectral shape. Nevertheless, speaker timbre and identity greatly depend on vocal tract peaks and valleys of spectrum. In this paper, we propose a method to define the warping function by maximizing the correlation between the converted and target spectra. Different from the conventional warping methods, the correlation-based optimization is not determined by the magnitude of the spectra. Instead, both spectral peaks and valleys are considered in the optimization process, which also improves the performance of amplitude scaling. Experiments were conducted on VOICES database, and the results show that after amplitude scaling our proposed method reduced the mel-spectral distortion from 5.85 dB to 5.60 dB. The subjective listening tests also confirmed the effectiveness of the proposed method.

show abstract

Transmutative voice conversion

Cited by 9 publications

References 24 publications

Foot-based intonation for text-to-speech synthesis using neural networks

Foot-based intonation for text-to-speech synthesis using neural networks

Voice conversion versus speaker verification: an overview

Correlation-based frequency warping for voice conversion

Contact Info

Product

Resources

About