Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Seshadri, Shreyas; Juvela, Lauri; Alku, Paavo; Räsänen, Okko

doi:10.21437/interspeech.2019-1681

Cited by 9 publications

(5 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Wang et al, 2018) and have also started to be applied to speech transformations. For example, GANs were recently used to transform a voice into its Lombard counterpart (a particular type of vocal effort which makes the voice more intelligible in background noise; Seshadri, Juvela, Alku, & Räsänen, 2019). All such advances open exciting new possibilities to create emotional voice and speech transformations, which will certainly find their way to the community in the upcoming years.…”

Section: A Prospective Note On Deep-learning Techniquesmentioning

confidence: 99%

“…Finally, generative adversarial networks (GANs), a special class of DNN architecture capable of learning a deterministic mapping from one style of stimulus to another (Goodfellow et al, 2014), are increasingly used to create visual transformations (e.g., smiles; W. Wang et al, 2018) and have also started to be applied to speech transformations. For example, GANs were recently used to transform a voice into its Lombard counterpart (a particular type of vocal effort which makes the voice more intelligible in background noise; Seshadri, Juvela, Alku, & Räsänen, 2019). All such advances open exciting new possibilities to create emotional voice and speech transformations, which will certainly find their way to the community in the upcoming years.…”

Section: A Prospective Note On Deep-learning Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

Beyond Correlation: Acoustic Transformation Methods for the Experimental Study of Emotional Voice and Speech

et al. 2020

View full text Add to dashboard Cite

While acoustic analysis methods have become a commodity in voice emotion research, experiments that attempt not only to describe but to computationally manipulate expressive cues in emotional voice and speech have remained relatively rare. We give here a nontechnical overview of voice-transformation techniques from the audio signal-processing community that we believe are ripe for adoption in this context. We provide sound examples of what they can achieve, examples of experimental questions for which they can be used, and links to open-source implementations. We point at a number of methodological properties of these algorithms, such as being specific, parametric, exhaustive, and real-time, and describe the new possibilities that these open for the experimental study of the emotional voice.

show abstract

Section: A Prospective Note On Deep-learning Techniquesmentioning

confidence: 99%

Section: A Prospective Note On Deep-learning Techniquesmentioning

confidence: 99%

Beyond Correlation: Acoustic Transformation Methods for the Experimental Study of Emotional Voice and Speech

et al. 2020

View full text Add to dashboard Cite

show abstract

“…To overcome this, deep neural network approaches were implemented where the robustness of acoustic modeling is improved by efficient mapping between linguistic and acoustic features. Inspired by the success of adversarial generative models, Cycle-consistent adversarial networks (CycleGANs) showed promising results in terms of speech quality and the magnitude of the perceptual change between speech styles [11,12]. An extension to recurrent neural networks and particularly long short-term memory networks (LSTMs) were proposed that it successfully adapted normal speaking style to Lombard style [13].…”

Section: Introductionmentioning

confidence: 99%

Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion

Paul¹,

Shifas²,

Pantazis³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

The increased adoption of digital assistants makes textto-speech (TTS) synthesis systems an indispensable feature of modern mobile devices. It is hence desirable to build a system capable of generating highly intelligible speech in the presence of noise. Past studies have investigated style conversion in TTS synthesis, yet degraded synthesized quality often leads to worse intelligibility. To overcome such limitations, we proposed a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis. The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC) which has been shown to provide high intelligibility gains by redistributing the signal energy on the time-frequency domain. We refer to this extension as Lombard-SSDRC TTS system. Intelligibility enhancement as quantified by the Intelligibility in Bits (SIIB Gauss) measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN), and 47% to 140% in competing-speaker noise (CSN) against the state-ofthe-art TTS approach. Additional subjective evaluation shows that Lombard-SSDRC TTS successfully increases the speech intelligibility with relative improvement of 455% for SSN and 104% for CSN in median keyword correction rate compared to the baseline TTS method.

show abstract

“…Inspired by human speech production characteristics, some algorithms (e.g., [13], [14], [15]) aim to convert normal speech to Lombard speech [16], which is naturally produced by speakers with increased vocal effort for higher intelligibility. To achieve speaking style conversion, most algorithms rely on vocoder-based analysis-and-synthesis techniques, where vocoder features are transformed to fit in the Lombard style.…”

Section: Introductionmentioning

confidence: 99%

“…To achieve speaking style conversion, most algorithms rely on vocoder-based analysis-and-synthesis techniques, where vocoder features are transformed to fit in the Lombard style. For example, Seshadri et al [15] modified Mel-generalized cepstrum coefficients [17] of input speech to generate the Lombard-style speech by using log-domain pulse model vocoder [18]. However, using such a parametric vocoder inevitably degrades the converted speech quality.…”

Section: Introductionmentioning

confidence: 99%

Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

Li¹,

Yamagishi²

2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

The intelligibility of speech severely degrades in the presence of environmental noise and reverberation. In this paper, we propose a novel deep learning based system for modifying the speech signal to increase its intelligibility under the equal-power constraint, i.e., signal power before and after modification must be the same. To achieve this, we use generative adversarial networks (GANs) to obtain time-frequency dependent amplification factors, which are then applied to the input raw speech to reallocate the speech energy. Instead of optimizing only a single, simple metric, we train a deep neural network (DNN) model to simultaneously optimize multiple advanced speech metrics, including both intelligibility-and quality-related ones, which results in notable improvements in performance and robustness. Our system can not only work in non-realtime mode for offline audio playback but also support practical real-time speech applications. Experimental results using both objective measurements and subjective listening tests indicate that the proposed system significantly outperforms state-ofthe-art baseline systems under various noisy and reverberant listening conditions.

show abstract

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Cited by 9 publications

References 25 publications

Beyond Correlation: Acoustic Transformation Methods for the Experimental Study of Emotional Voice and Speech

Beyond Correlation: Acoustic Transformation Methods for the Experimental Study of Emotional Voice and Speech

Enhancing Speech Intelligibility in Text-To-Speech Synthesis Using Speaking Style Conversion

Multi-Metric Optimization Using Generative Adversarial Networks for Near-End Speech Intelligibility Enhancement

Contact Info

Product

Resources

About