Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Krueger, David A.; Maharaj, Tegan; Kramár, János; Pezeshki, Mohammad Zakaria; Ballas, Nicolas; Ke, Nan Rosemary; Goyal, Anirudh; Bengio, Yoshua; Courville, Aaron; Pal, Chris

doi:10.48550/arxiv.1606.01305

Cited by 64 publications

(87 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The attention modules used have a mixture of 5 logistic distributions and 256-dimensional feed-forward layers. Dropout regularization [33] of rate 0.5 is applied on all Pre-Net and Post-Net layers and Zoneout [34] of rate 0.1 is applied on LSTM layers. We use the Adam optimizer [35] for training the network parameters with batch size 32.…”

Section: Methodsmentioning

confidence: 99%

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Markopoulos,

Ellinas,

Vioni

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, a text-to-rapping/singing system is introduced, which can be adapted to any speaker's voice. It utilizes a Tacotron-based multispeaker acoustic model trained on readonly speech data and which provides prosody control at the phoneme level. Dataset augmentation and additional prosody manipulation based on traditional DSP algorithms are also investigated. The neural TTS model is fine-tuned to an unseen speaker's limited recordings, allowing rapping/singing synthesis with the target's speaker voice. The detailed pipeline of the system is described, which includes the extraction of the target pitch and duration values from an a capella song and their conversion into target speaker's valid range of notes before synthesis. An additional stage of prosodic manipulation of the output via WSOLA is also investigated for better matching the target duration values. The synthesized utterances can be mixed with an instrumental accompaniment track to produce a complete song. The proposed system is evaluated via subjective listening tests as well as in comparison to an available alternate system which also aims to produce synthetic singing voice from read-only training data. Results show that the proposed approach can produce high quality rapping/singing voice with increased naturalness.

show abstract

Section: Methodsmentioning

confidence: 99%

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Markopoulos,

Ellinas,

Vioni

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Where β i (t) is the self-attention coefficients of temporal patches, W q , W k , W i are learnable parameters, and d is the feature dimension of z i . A layer normalization operation [35] is added after the transaction among all patches.…”

Section: Sparse Temporal Transformermentioning

confidence: 99%

Sparse-Dyn: Sparse Dynamic Graph Multi-representation Learning via Event-based Sparse Temporal Attention Network

Yan¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

GStatic graph neural networks have been widely used in modeling and representation learning of graph structure data. However, many real-world problems, such as social networks, financial transactions, recommendation systems, etc., are dynamic, that is, nodes and edges are added or deleted over time. Therefore, in recent years, dynamic graph neural networks have received more and more attention from researchers. In this work, we propose a novel dynamic graph neural network, Efficient-Dyn. It adaptively encodes temporal information into a sequence of patches with an equal amount of temporal-topological structure. Therefore, while avoiding the use of snapshots to cause information loss, it also achieves a finer time granularity, which is close to what continuous networks could provide. In addition, we also designed a lightweight module, Sparse Temporal Transformer, to compute node representations through both structural neighborhoods and temporal dynamics. Since the fully-connected attention conjunction is simplified, the computation cost is far lower than the current state-of-the-arts. Link prediction experiments are conducted on both continuous and discrete graph datasets. Through comparing with several state-of-the-art graph embedding baselines, the experimental results demonstrate that Efficient-Dyn has a faster inference speed while having competitive performance.

show abstract

“…Moreover, we introduce other architectural changes in the multispeaker Tacotron 2, thereby enhancing the quality of the alignment process: (a) the speaker embedding vector is passed through an additional linear layer to stimulate the extraction of more meaningful speaker characteristics; (b) a skip connection represented by the concatenation of the first decoder LSTM output with the attention context vector is added, as shown in Figure 2; (c) the previous time step context vector, ci−1, is used to predict the next mel-spectrogram frame in (9). In addition to the regularizations proposed for the original single-speaker Tacotron 2 [2], we apply dropout [29] with probability 0.1 to the input of the dynamic convolution filters (13) and increase the zoneout [30] probability for the second decoder LSTM layer to 0.15. In practice, it was found that all of these changes result in improved alignment consistency.…”

Section: Zero-shot Long-form Voice Cloningmentioning

confidence: 99%

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

Gorodetskii¹,

Ozhiganov²

2022

Preprint

View full text Add to dashboard Cite

With recent advancements in voice cloning, the performance of speech synthesis for a target speaker has been rendered similar to the human level. However, autoregressive voice cloning systems still suffer from text alignment failures, resulting in an inability to synthesize long sentences. In this work, we propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well. The proposed system is based on three independently trained components: a speaker encoder, synthesizer and universal vocoder. Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention, in combination with a set of modifications proposed for the synthesizer based on Tacotron 2. Moreover, effective zero-shot speaker adaptation is achieved by conditioning both the synthesizer and vocoder on a speaker encoder that has been pretrained on a large corpus of diverse data. We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances, and conclude that the proposed model can produce intelligible synthetic speech for extremely long utterances, while preserving a high extent of naturalness and similarity for short texts.

show abstract

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Cited by 64 publications

References 12 publications

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control

Sparse-Dyn: Sparse Dynamic Graph Multi-representation Learning via Event-based Sparse Temporal Attention Network

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

Contact Info

Product

Resources

About