Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs

Meenakshi, G.; Ghosh, Prasanta Kumar

doi:10.21437/interspeech.2018-1487

Cited by 15 publications

(14 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Speech and whisper spectral envelopes can be mapped via Restricted Boltzmann Array (RBM) [13], or converted to Mel Frequency Cepstrum Coefficients (MFCC) for regression with Gaussian Mixture Models (GMM) [11], [12]. Deep Neural Networks (DNN) [14], [15] and Bidirectional Long Short-Term Memory Networks (Bi-LSTM) [16] have also been used. The f 0 and V/UV decisions are sometimes combined (where f 0 = 0 means 'unvoiced') [11], [15], although performance improves when they are predicted separately using DNN [12], support vector machine (SVM), support vector regression (SVR) [13], or Bi-LSTM [16].…”

Section: A Whisper-to-speech Systemsmentioning

confidence: 99%

“…Deep Neural Networks (DNN) [14], [15] and Bidirectional Long Short-Term Memory Networks (Bi-LSTM) [16] have also been used. The f 0 and V/UV decisions are sometimes combined (where f 0 = 0 means 'unvoiced') [11], [15], although performance improves when they are predicted separately using DNN [12], support vector machine (SVM), support vector regression (SVR) [13], or Bi-LSTM [16]. Finally, the STRAIGHT vocoder [28] has been used to generate mixed-excitation when aperiodicity components are available [11], [12], [16], but pulse trains are used when no aperiodicity components exist.…”

Section: A Whisper-to-speech Systemsmentioning

confidence: 99%

“…The main challenges in WSC are (i) generating a high-quality phonated source; (ii) predicting phonated source parameters such as pitch contour and voiced/unvoiced decisions; (iii) adapting the glottis and vocal tract filters. Most previous research has focused on steps (ii) and (iii), namely pitch prediction [1], [11]- [22] or vocal tract adaptation through filter modification [11]- [14], [16], [21]- [25] while using simple pulse trains for the replacement phonated excitation. Nevertheless, we strongly believe that overall reconstruction quality is severely hampered by lack of naturalness in the excitation source.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Glottal Flow Synthesis for Whisper-to-Speech Conversion

Perrotin

McLoughlin

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Whisper-to-speech conversion is motivated by laryngeal disorders, in which malfunction of the vocal folds leads to loss of voicing. Many patients with laryngeal disorders can still produce functional whispers, since these are characterised by the absence of vocal fold vibration. Whispers therefore constitute a common ground for speech rehabilitation across many kinds of laryngeal disorder. Whisper-to-speech conversion involves recreating natural-sounding speech from recorded whispers, and is a non-invasive and non-surgical rehabilitation that can maintain a natural method of speaking, unlike the existing methods of rehabilitation. This paper proposes a new rule-based method for whisper-to-speech conversion that replaces the noisy whisper sound source with a synthesised speech-like harmonic source, while maintaining the vocal tract component unaltered. In particular, a novel glottal source generator is developed in which whisper information is used to parameterise the excitation through a high-quality glottis model. Evaluation of the system against the standard pulse train excitation method reveals significantly improved performance. Since our method is glottis-based, it is potentially compatible with the many existing vocal tract component adaptation systems.

show abstract

Section: A Whisper-to-speech Systemsmentioning

confidence: 99%

Section: A Whisper-to-speech Systemsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Glottal Flow Synthesis for Whisper-to-Speech Conversion

Perrotin

McLoughlin

2020

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

show abstract

“…SSC can be also used for generation of context-dependent speech samples from a limited set of original recordings for recreational applications such as gaming and virtual reality. While there has already been work in whispered-to-normal speech conversion (e.g., [4][5][6][7][8]), SSC for other aspects of vocal effort has only been studied in a small number of previous works [9][10][11][12][13] that only focus on direct signal manipulation or parallel data training.…”

Section: Introductionmentioning

confidence: 99%

Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion

Seshadri

Juvela

Yamagishi

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Speaking style conversion (SSC) is the technology of converting natural speech signals from one style to another. In this study, we propose the use of cycle-consistent adversarial networks (CycleGANs) for converting styles with varying vocal effort, and focus on conversion between normal and Lombard styles as a case study of this problem. We propose a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract speech features. These features are mapped using the CycleGAN from utterances in the source style to the corresponding features of target speech. Finally, the mapped features are converted to a Lombard speech waveform with the PML. The CycleGAN was compared in subjective listening tests with 2 other standard mapping methods used in conversion, and the CycleGAN was found to have the best performance in terms of speech quality and in terms of the magnitude of the perceptual change between the two styles.

show abstract

“…SSC has been previously studied in whisper-to-normal conversion [3][4][5] and in normal-to-Lombard conversion [6][7][8]. In addition, a parametric approach to normal-to-Lombard SSC was recently explored in [9], where a vocoder was used to extract frame level features that were then transformed from normal to Lombard style using parallel data-driven mapping models, and then synthesized as speech in the target style using the same vocoder.…”

Section: Introductionmentioning

confidence: 99%

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Seshadri¹,

Juvela²,

Alku³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Lombard speech is a speaking style associated with increased vocal effort that is naturally used by humans to improve intelligibility in the presence of noise. It is hence desirable to have a system capable of converting speech from normal to Lombard style. Moreover, it would be useful if one could adjust the degree of Lombardness in the converted speech so that the system is more adaptable to different noise environments. In this study, we propose the use of recently developed Augmented cycleconsistent adversarial networks (Augmented CycleGANs) for conversion between normal and Lombard speaking styles. The proposed system gives a smooth control on the degree of Lombardness of the mapped utterances by traversing through different points in the latent space of the trained model. We utilize a parametric approach that uses the Pulse Model in Log domain (PML) vocoder to extract features from normal speech that are then mapped to Lombard-style features using the Augmented CycleGAN. Finally, the mapped features are converted to Lombard speech with PML. The model is trained on multi-language data recorded in different noise conditions, and we compare its effectiveness to a previously proposed CycleGAN system in experiments for intelligibility and quality of mapped speech.

show abstract

Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs

Cited by 15 publications

References 17 publications

Glottal Flow Synthesis for Whisper-to-Speech Conversion

Glottal Flow Synthesis for Whisper-to-Speech Conversion

Cycle-consistent Adversarial Networks for Non-parallel Vocal Effort Based Speaking Style Conversion

Augmented CycleGANs for Continuous Scale Normal-to-Lombard Speaking Style Conversion

Contact Info

Product

Resources

About