Batch normalized recurrent neural networks

Laurent, César; Pereyra, Gabriel; Brakel, Philémon; Zhang, Ying; Bengio, Yoshua

doi:10.1109/icassp.2016.7472159

Cited by 170 publications

(116 citation statements)

References 14 publications

Supporting

Mentioning

113

Contrasting

Order By: Relevance

“…That system only contains the deep clustering part, which corresponds to α = 1 in the hybrid system. In the MIREX system, dropout layers with probability 0.2 were added between each feed-forward connection, and sequence-wise batch normalization [20] was applied in the input-to-hidden transformation in each BLSTM layer. Similarly to [13], we also applied a curriculum learning strategy [21], where we first train the network on segments of 100 frames, then train on segments of 500 frames.…”

Section: Evaluation and Discussionmentioning

confidence: 99%

Deep clustering and conventional networks for music separation: Stronger together

Luo

Zhuo

Hershey

et al. 2017

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

143

101

View full text Add to dashboard Cite

Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation, presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components.

show abstract

Section: Evaluation and Discussionmentioning

confidence: 99%

Deep clustering and conventional networks for music separation: Stronger together

Luo

Zhuo

Hershey

et al. 2017

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

143

101

View full text Add to dashboard Cite

show abstract

“…In [33], authors suggest to apply it to feed-forward connections only, while in [34] the normalization step is extended to recurrent connections, using separate statistics for each time-step. In this work, we tried both approaches and we observed a comparable performance between them.…”

Section: Batch Normalizationmentioning

confidence: 99%

Improving Speech Recognition by Revising Gated Recurrent Units

et al. 2017

Self Cite

View full text Add to dashboard Cite

Speech recognition is largely taking advantage of deep learning, showing that substantial benefits can be obtained by modern Recurrent Neural Networks (RNNs). The most popular RNNs are Long Short-Term Memory (LSTMs), which typically reach state-of-the-art performance in many tasks thanks to their ability to learn long-term dependencies and robustness to vanishing gradients. Nevertheless, LSTMs have a rather complex design with three multiplicative gates, that might impair their efficient implementation. An attempt to simplify LSTMs has recently led to Gated Recurrent Units (GRUs), which are based on just two multiplicative gates.This paper builds on these efforts by further revising GRUs and proposing a simplified architecture potentially more suitable for speech recognition. The contribution of this work is two-fold. First, we suggest to remove the reset gate in the GRU design, resulting in a more efficient single-gate architecture. Second, we propose to replace tanh with ReLU activations in the state update equations. Results show that, in our implementation, the revised architecture reduces the per-epoch training time with more than 30% and consistently improves recognition performance across different tasks, input features, and noisy conditions when compared to a standard GRU.

show abstract

“…HARDWARE ACCELERATED QUANTIZED LSTM There have been many research efforts from the algorithm point of view on binarizing or quantizing the feedforward neural networks like CNN and MLP [23], [24]. Binarizing LSTM is more challenging than binarizing the CNN or MLP as it is difficult to adopt the back-end techniques like batch normalization in a recurrent neural network [25]. Instead, quantized LSTM has been studied and it is revealed that low quantization bit-widths can be achieved by quantizing the weights and hidden state during forward propagation and using straight-through estimator (STE) to propagate the gradient for weight update [26], [27].…”

Section: Vector-matrix Multiplication Accelerated By Nvm Arraymentioning

confidence: 99%

Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

Ma¹,

Chiu²,

Choi³

et al. 2019

2019 IEEE International Conference on Rebooting Computing (ICRC)

View full text Add to dashboard Cite

In cloud and edge computing models, it is important that compute devices at the edge be as power efficient as possible. Long short-term memory (LSTM) neural networks have been widely used for natural language processing, time series prediction and many other sequential data tasks. Thus, for these applications there is increasing need for low-power accelerators for LSTM model inference at the edge. In order to reduce power dissipation due to data transfers within inference devices, there has been significant interest in accelerating vector-matrix multiplication (VMM) operations using non-volatile memory (NVM) weight arrays. In NVM array-based hardware, reduced bit-widths also significantly increases the power efficiency. In this paper, we focus on the application of quantization-aware training algorithm to LSTM models, and the benefits these models bring in terms of resilience against both quantization error and analog device noise. We have shown that only 4-bit NVM weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network performance as floating-point baseline. Reasonable levels of ADC quantization noise and weight noise can be naturally tolerated within our NVMbased quantized LSTM network. Benchmark analysis of our proposed LSTM accelerator for inference has shown at least 2.4× better computing efficiency and 40× higher area efficiency than traditional digital approaches (GPU, FPGA, and ASIC). Some other novel approaches based on NVM promise to deliver higher computing efficiency (up to ×4.7) but require larger arrays with potential higher error rates. Keywords-quantization, noise, LSTM, non-volatile memory, machine learning hardwareI.

show abstract

Batch normalized recurrent neural networks

Cited by 170 publications

References 14 publications

Deep clustering and conventional networks for music separation: Stronger together

Deep clustering and conventional networks for music separation: Stronger together

Improving Speech Recognition by Revising Gated Recurrent Units

Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

Contact Info

Product

Resources

About