Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1487
|View full text |Cite
|
Sign up to set email alerts
|

Whispered Speech to Neutral Speech Conversion Using Bidirectional LSTMs

Abstract: We propose a bidirectional long short-term memory (BLSTM) based whispered speech to neutral speech conversion system that employs the STRAIGHT speech synthesizer. We use a BLSTM to map the spectral features of whispered speech to those of neutral speech. Three other BLSTMs are employed to predict the pitch, periodicity levels and the voiced/unvoiced phoneme decisions from the spectral features of whispered speech. We use objective measures to quantify the quality of the predicted spectral features and excitati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
13
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(14 citation statements)
references
References 17 publications
0
13
0
Order By: Relevance
“…Speech and whisper spectral envelopes can be mapped via Restricted Boltzmann Array (RBM) [13], or converted to Mel Frequency Cepstrum Coefficients (MFCC) for regression with Gaussian Mixture Models (GMM) [11], [12]. Deep Neural Networks (DNN) [14], [15] and Bidirectional Long Short-Term Memory Networks (Bi-LSTM) [16] have also been used. The f 0 and V/UV decisions are sometimes combined (where f 0 = 0 means 'unvoiced') [11], [15], although performance improves when they are predicted separately using DNN [12], support vector machine (SVM), support vector regression (SVR) [13], or Bi-LSTM [16].…”
Section: A Whisper-to-speech Systemsmentioning
confidence: 99%
See 2 more Smart Citations
“…Speech and whisper spectral envelopes can be mapped via Restricted Boltzmann Array (RBM) [13], or converted to Mel Frequency Cepstrum Coefficients (MFCC) for regression with Gaussian Mixture Models (GMM) [11], [12]. Deep Neural Networks (DNN) [14], [15] and Bidirectional Long Short-Term Memory Networks (Bi-LSTM) [16] have also been used. The f 0 and V/UV decisions are sometimes combined (where f 0 = 0 means 'unvoiced') [11], [15], although performance improves when they are predicted separately using DNN [12], support vector machine (SVM), support vector regression (SVR) [13], or Bi-LSTM [16].…”
Section: A Whisper-to-speech Systemsmentioning
confidence: 99%
“…Deep Neural Networks (DNN) [14], [15] and Bidirectional Long Short-Term Memory Networks (Bi-LSTM) [16] have also been used. The f 0 and V/UV decisions are sometimes combined (where f 0 = 0 means 'unvoiced') [11], [15], although performance improves when they are predicted separately using DNN [12], support vector machine (SVM), support vector regression (SVR) [13], or Bi-LSTM [16]. Finally, the STRAIGHT vocoder [28] has been used to generate mixed-excitation when aperiodicity components are available [11], [12], [16], but pulse trains are used when no aperiodicity components exist.…”
Section: A Whisper-to-speech Systemsmentioning
confidence: 99%
See 1 more Smart Citation
“…SSC can be also used for generation of context-dependent speech samples from a limited set of original recordings for recreational applications such as gaming and virtual reality. While there has already been work in whispered-to-normal speech conversion (e.g., [4][5][6][7][8]), SSC for other aspects of vocal effort has only been studied in a small number of previous works [9][10][11][12][13] that only focus on direct signal manipulation or parallel data training.…”
Section: Introductionmentioning
confidence: 99%
“…SSC has been previously studied in whisper-to-normal conversion [3][4][5] and in normal-to-Lombard conversion [6][7][8]. In addition, a parametric approach to normal-to-Lombard SSC was recently explored in [9], where a vocoder was used to extract frame level features that were then transformed from normal to Lombard style using parallel data-driven mapping models, and then synthesized as speech in the target style using the same vocoder.…”
Section: Introductionmentioning
confidence: 99%