Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

Zen, Heiga; Sak, Haşim

doi:10.1109/icassp.2015.7178816

Cited by 244 publications

(210 citation statements)

References 28 publications

Supporting

Mentioning

208

Contrasting

Unclassified

Order By: Relevance

“…The specifics of the training method will be discussed in section 4. Note that output layers are also recurrent, so that dynamic features are not computed because feedback connections within the layer keep track of the dynamic evolution of outputs [7]. The intuition behind this architecture is that, whilst every output branch is trained, it shares the first linguistic mappings with other branches.…”

Section: Proposed Architecturementioning

confidence: 99%

“…In the case of speech synthesis, many works included DNNs and DBNs to perform acoustic mappings and prosody prediction [2,3,4], Also, Recurrent Neural Networks (RNNs) and their variants, like the Long Short Term Memory (LSTM) architecture [5], have leveraged completely the sequences processing and prediction problem, which makes them lead to interesting results in the speech synthesis field, where an acoustic signal of variable length has to be generated out of a set of textual entities. Some example works using this structures can be seen in [6,7,8,9]. Previous to deep learning, existing text to speech technologies included the unit selection speech synthesis [10] and the statistical parametric speech synthesis (SPSS) [11].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model

Pascual¹,

Bonafonte²

2016

9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)

View full text Add to dashboard Cite

Deep Learning has been applied successfully to speech processing. In this paper we propose an architecture for speech synthesis using multiple speakers. Some hidden layers are shared by all the speakers, while there is a specific output layer for each speaker. Objective and perceptual experiments prove that this scheme produces much better results in comparison with single speaker model. Moreover, we also tackle the problem of speaker interpolation by adding a new output layer (α-layer) on top of the multi-output branches. An identifying code is injected into the layer together with acoustic features of many speakers. Experiments show that the α-layer can effectively learn to interpolate the acoustic features between speakers.

show abstract

Section: Proposed Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model

Pascual¹,

Bonafonte²

2016

9th ISCA Workshop on Speech Synthesis Workshop (SSW 9)

View full text Add to dashboard Cite

show abstract

“…(24) corresponds to the sum of squares of the inverse system output. 4 The definition of the linguistic feature vector used in this paper can be found in [6] and [19]. Log likelihoods of trained LSTM-RNNs over both training and development subsets (60,000 samples).…”

Section: By Assumingmentioning

confidence: 99%

“…The training and development data sets consisted of 34,632 and 100 utterances, respectively. A speakerdependent unidirectional LSTM-RNN [19] was trained. From the speech data, its associated transcriptions, and automatically derived phonetic alignments, sample-level linguistic features included 535 linguistic contexts, 50 numerical features for coarsecoded position of the current sample in the current phoneme, and one numerical feature for duration of the current phoneme.…”

Section: Experimental Conditionsmentioning

confidence: 99%

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

Tokuday

Zen

2015

2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

This paper proposes a novel approach for directly-modeling speech at the waveform level using a neural network. This approach uses the neural network-based statistical parametric speech synthesis framework with a specially designed output layer. As acoustic feature extraction is integrated to acoustic model training, it can overcome the limitations of conventional approaches, such as two-step (feature extraction and acoustic modeling) optimization, use of spectra rather than waveforms as targets, use of overlapping and shifting frames as unit, and fixed decision tree structure. Experimental results show that the proposed approach can directly maximize the likelihood defined at the waveform domain.Index Terms-Statistical parametric speech synthesis; neural network; adaptive cepstral analysis.

show abstract

“…The disadvantage of using such networks is they cannot directly model the dependent nature of each sequence of parameters with the former, which is desirable to mimic the production of human speech. To solve this problem, it has been suggested to include RNN [21] [22] in which there is feedback from some of the neurons in the network, backwards or to themselves, forming a kind of memory that retains previous states.…”

Section: Long Short-term Memory Recurrent Neural Networkmentioning

confidence: 99%

LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices

Coto-Jiménez

Goddard-Close

2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract-Recent developments in speech synthesis have produced systems capable of outcome intelligible speech, but now researchers strive to create models that more accurately mimic human voices. One such development is the incorporation of multiple linguistic styles in various languages and accents.HMM-based Speech Synthesis is of great interest to many researchers, due to its ability to produce sophisticated features with small footprint. Despite such progress, its quality has not yet reached the level of the predominant unit-selection approaches that choose and concatenate recordings of real speech. Recent efforts have been made in the direction of improving these systems.In this paper we present the application of Long-Short Term Memory Deep Neural Networks as a Postfiltering step of HMM-based speech synthesis, in order to obtain closer spectral characteristics to those of natural speech. The results show how HMM-voices could be improved using this approach.

show abstract

Unidirectional long short-term memory recurrent neural network with recurrent output layer for low-latency speech synthesis

Cited by 244 publications

References 28 publications

Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model

Multi-output RNN-LSTM for multiple speaker speech synthesis with α-interpolation model

Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis

LSTM Deep Neural Networks Postfiltering for Improving the Quality of Synthetic Voices

Contact Info

Product

Resources

About