2016
DOI: 10.48550/arxiv.1606.01305
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations

Abstract: We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble, improving generalization. But by preserving instead of dropping hidden units, gradient information and state information are more readily propagated through time, as in feedforward stochastic depth networks. We perform an empirical investigation of various RNN regularizers, and find that… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
85
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 64 publications
(87 citation statements)
references
References 12 publications
2
85
0
Order By: Relevance
“…The attention modules used have a mixture of 5 logistic distributions and 256-dimensional feed-forward layers. Dropout regularization [33] of rate 0.5 is applied on all Pre-Net and Post-Net layers and Zoneout [34] of rate 0.1 is applied on LSTM layers. We use the Adam optimizer [35] for training the network parameters with batch size 32.…”
Section: Methodsmentioning
confidence: 99%
“…The attention modules used have a mixture of 5 logistic distributions and 256-dimensional feed-forward layers. Dropout regularization [33] of rate 0.5 is applied on all Pre-Net and Post-Net layers and Zoneout [34] of rate 0.1 is applied on LSTM layers. We use the Adam optimizer [35] for training the network parameters with batch size 32.…”
Section: Methodsmentioning
confidence: 99%
“…Where β i (t) is the self-attention coefficients of temporal patches, W q , W k , W i are learnable parameters, and d is the feature dimension of z i . A layer normalization operation [35] is added after the transaction among all patches.…”
Section: Sparse Temporal Transformermentioning
confidence: 99%
“…Moreover, we introduce other architectural changes in the multispeaker Tacotron 2, thereby enhancing the quality of the alignment process: (a) the speaker embedding vector is passed through an additional linear layer to stimulate the extraction of more meaningful speaker characteristics; (b) a skip connection represented by the concatenation of the first decoder LSTM output with the attention context vector is added, as shown in Figure 2; (c) the previous time step context vector, ci−1, is used to predict the next mel-spectrogram frame in (9). In addition to the regularizations proposed for the original single-speaker Tacotron 2 [2], we apply dropout [29] with probability 0.1 to the input of the dynamic convolution filters (13) and increase the zoneout [30] probability for the second decoder LSTM layer to 0.15. In practice, it was found that all of these changes result in improved alignment consistency.…”
Section: Zero-shot Long-form Voice Cloningmentioning
confidence: 99%