Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Zen, Heiga; Agiomyrgiannakis, Yannis; Egberts, Niels; Henderson, Fergus; Szczepaniak, Przemysław

doi:10.21437/interspeech.2016-522

Cited by 81 publications

(50 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The hyperparameters σ g , λ ga , λ cp were 0.4, 10,000, and 10, respectively. The batch size, number of epochs, and reduction factor [49] were 32, 1,000 and 5. We used the Adam optimizer [50] and varied the learning rate over the course of training [51].…”

Section: Methodsmentioning

confidence: 99%

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Tanaka

Kameoka

Kaneko

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

111

View full text Add to dashboard Cite

This paper describes a method based on a sequenceto-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling such as speech synthesis and recognition, machine translation, and image captioning. In contrast to current VC techniques, our method 1) stabilizes and accelerates the training procedure by considering guided attention and proposed context preservation losses, 2) allows not only spectral envelopes but also fundamental frequency contours and durations of speech to be converted, 3) requires no context information such as phoneme labels, and 4) requires no time-aligned source and target speech data in advance. In our experiment, the proposed VC framework can be trained in only one day, using only one GPU of an NVIDIA Tesla K80, while the quality of the synthesized speech is higher than that of speech converted by Gaussian mixture model-based VC and is comparable to that of speech generated by recurrent neural network-based text-to-speech synthesis, which can be regarded as an upper limit on VC performance.

show abstract

Section: Methodsmentioning

confidence: 99%

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Tanaka

Kameoka

Kaneko

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

111

View full text Add to dashboard Cite

show abstract

“…In this section, we look at how to tailor deep learning to mobile networking applications from three perspectives, namely, mobile devices and systems, distributed data centers, and changing mobile network environments. [513] Filter size shrinking, reducing input channels and late downsampling CNN Howard et al [514] Depth-wise separable convolution CNN Zhang et al [515] Point-wise group convolution and channel shuffle CNN Zhang et al [516] Tucker decomposition AE Cao et al [517] Data parallelization by RenderScript RNN Chen et al [518] Space exploration for data reusability and kernel redundancy removal CNN Rallapalli et al [519] Memory optimizations CNN Lane et al [520] Runtime layer compression and deep architecture decomposition MLP, CNN Huynh et al [521] Caching, Tucker decomposition and computation offloading CNN Wu et al [522] Parameters quantization CNN Bhattacharya and Lane [523] Sparsification of fully-connected layers and separation of convolutional kernels MLP, CNN Georgiev et al [97] Representation sharing MLP Cho and Brand [524] Convolution operation optimization CNN Guo and Potkonjak [525] Filters and classes pruning CNN Li et al [526] Cloud assistance and incremental learning CNN Zen et al [527] Weight quantization LSTM Falcao et al [528] Parallelization and memory sharing Stacked AE Fang et al [529] Model pruning and recovery scheme CNN Xu et al [530] Reusable region lookup and reusable region propagation scheme CNN…”

Section: Tailoring Deep Learning To Mobile Networkmentioning

confidence: 99%

“…Beyond these works, researchers also successfully adapt deep learning architectures through other designs and sophisticated optimizations, such as parameters quantization [522], [527], sparsification and separation [523], representation and memory sharing [97], [528], convolution operation optimization [524], pruning [525], cloud assistance [526] and compiler optimization [532]. These techniques will be of great significance when embedding deep neural networks into mobile systems.…”

Section: A Tailoring Deep Learning To Mobile Devices and Systemsmentioning

confidence: 99%

Deep Learning in Mobile and Wireless Networking: A Survey

Zhang

Patras

Haddadi

2019

IEEE Commun. Surv. Tutorials

1,328

838

View full text Add to dashboard Cite

The rapid uptake of mobile devices and the rising popularity of mobile applications and services pose unprecedented demands on mobile and wireless networking infrastructure. Upcoming 5G systems are evolving to support exploding mobile traffic volumes, real-time extraction of fine-grained analytics, and agile management of network resources, so as to maximize user experience. Fulfilling these tasks is challenging, as mobile environments are increasingly complex, heterogeneous, and evolving. One potential solution is to resort to advanced machine learning techniques, in order to help manage the rise in data volumes and algorithm-driven applications. The recent success of deep learning underpins new and powerful tools that tackle problems in this space.In this paper we bridge the gap between deep learning and mobile and wireless networking research, by presenting a comprehensive survey of the crossovers between the two areas. We first briefly introduce essential background and state-of-theart in deep learning techniques with potential applications to networking. We then discuss several techniques and platforms that facilitate the efficient deployment of deep learning onto mobile systems. Subsequently, we provide an encyclopedic review of mobile and wireless networking research based on deep learning, which we categorize by different domains. Drawing from our experience, we discuss how to tailor deep learning to mobile environments. We complete this survey by pinpointing current challenges and open future directions for research.

show abstract

“…Another direction focuses on efficient storage and representation of weights. Various techniques, such as weight sharing within Toeplitz matrices [19], weight tying through effective hashing [20], and appropriate weight quantization [21][22][23], can greatly reduce model size, in some cases at the expense of a slight performance degradation.…”

Section: Related Workmentioning

confidence: 99%

Grow and Prune Compact, Fast, and Accurate LSTMs

Dai

Yin

Jha

2020

IEEE Trans. Comput.

View full text Add to dashboard Cite

Long short-term memory (LSTM) has been widely used for sequential data modeling. Researchers have increased LSTM depth by stacking LSTM cells to improve performance. This incurs model redundancy, increases run-time delay, and makes the LSTMs more prone to overfitting. To address these problems, we propose a hidden-layer LSTM (H-LSTM) that adds hidden layers to LSTM's original onelevel non-linear control gates. H-LSTM increases accuracy while employing fewer external stacked layers, thus reducing the number of parameters and run-time latency significantly. We employ grow-and-prune (GP) training to iteratively adjust the hidden layers through gradient-based growth and magnitude-based pruning of connections. This learns both the weights and the compact architecture of H-LSTM control gates. We have GP-trained H-LSTMs for image captioning and speech recognition applications. For the NeuralTalk architecture on the MSCOCO dataset, our three models reduce the number of parameters by 38.7× [floating-point operations (FLOPs) by 45.5×], run-time latency by 4.5×, and improve the CIDEr score by 2.6. For the DeepSpeech2 architecture on the AN4 dataset, our two models reduce the number of parameters by 19.4× (FLOPs by 23.5×), run-time latency by 15.7%, and the word error rate from 12.9% to 8.7%. Thus, GP-trained H-LSTMs can be seen to be compact, fast, and accurate.

show abstract

Fast, Compact, and High Quality LSTM-RNN Based Statistical Parametric Speech Synthesizers for Mobile Devices

Cited by 81 publications

References 34 publications

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

ATTS2S-VC: Sequence-to-sequence Voice Conversion with Attention and Context Preservation Mechanisms

Deep Learning in Mobile and Wireless Networking: A Survey

Grow and Prune Compact, Fast, and Accurate LSTMs

Contact Info

Product

Resources

About