Improving Speech Recognition by Revising Gated Recurrent Units

Ravanelli, Mirco; Brakel, Philémon; Omologo, Maurizio; Bengio, Yoshua

doi:10.21437/interspeech.2017-775

Cited by 44 publications

(46 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work uses hybrid HMM-DNN speech recognizers. TIMIT and DIRHA experiments are performed with the PyTorch-Kaldi toolkit [35] using a six-layer multi-layer perceptron and a light GRU [36], respectively. The performance reported on TIMIT is the average of the phone error rates (PER%) obtained by running each experiment three times with different seeds.…”

Section: Corpora and Asr Setupmentioning

confidence: 99%

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Ravanelli

Zhong

Pascual

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

225

140

View full text Add to dashboard Cite

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short-and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation.Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

show abstract

Section: Corpora and Asr Setupmentioning

confidence: 99%

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Ravanelli

Zhong

Pascual

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

225

140

View full text Add to dashboard Cite

show abstract

“…The third row shows the improvement achieved when adding recurrent dropout. Similarly to [40,41], we applied the same dropout mask for all the time steps to avoid gradient vanishing problems. The fourth line, instead, shows the benefits derived from batch normalization [18].…”

Section: Baselinesmentioning

confidence: 99%

The Pytorch-kaldi Speech Recognition Toolkit

Ravanelli¹,

Parcollet²,

Bengio³

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

217

135

View full text Add to dashboard Cite

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility.The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers. For instance, the code is specifically designed to naturally plug-in user-defined acoustic models. As an alternative, users can exploit several pre-implemented neural networks that can be customized using intuitive configuration files. PyTorch-Kaldi supports multiple feature and label streams as well as combinations of neural networks, enabling the use of complex neural architectures. The toolkit is publicly-released along with a rich documentation and is designed to properly work locally or on HPC clusters.Experiments, that are conducted on several datasets and tasks, show that PyTorch-Kaldi can effectively be used to develop modern state-of-the-art speech recognizers.

show abstract

“…Zhou et al [43] accomplished this very 'minimal' gated unit by using a single gate for both resetting and updating the cell's internal state. Ravanelli et al [31] extended that work further by highlighting a redundancy between the two gates. They deduced that in applications like speech recognition where signals change slowly, reset gates are unnecessary and can be omitted altogether.…”

Section: Single Gate Mechanismmentioning

confidence: 91%

“…However, in applications where events of interest are abrupt and isolated (eg. detecting cough sounds), the assumption by Ravanelli et al [31] that state resets are irrelevant does not hold. In fact, we found that without state resets, recurrent units in our application are unable to recover from large impulse signals.…”

Section: Single Gate Mechanismmentioning

confidence: 99%

“…The resulting 'minimal' gated unit (MGU) directly effects both cell update and reset using the same gate. Ravanelli et al [31] discarded the reset gate altogether, highlighting that it is potentially redundant in speech recognition where input signals evolve slowly. Our work investigates this further in AED tasks where events are sudden and infrequent, in sharp contrast to speech recognition.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting

Amoh

Odame

2019

Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.

View full text Add to dashboard Cite

There is growing interest in being able to run neural networks on sensors, wearables and internet-ofthings (IoT) devices. However, the computational demands of neural networks make them difficult to deploy on resource-constrained edge devices. To meet this need, our work introduces a new recurrent unit architecture that is specifically adapted for on-device low power acoustic event detection (AED). The proposed architecture is based on the gated recurrent unit ('GRU' -introduced by Cho et al. [9]) but features optimizations that make it implementable on ultra-low power micro-controllers such as the Arm Cortex M0+. Our new architecture, the Embedded Gated Recurrent Unit (eGRU) is demonstrated to be highly efficient and suitable for short-duration AED and keyword spotting tasks. A single eGRU cell is 60× faster and 10× smaller than a GRU cell. Despite its optimizations, eGRU compares well with GRU across tasks of varying complexities. The practicality of eGRU is investigated in a wearable acoustic event detection application. An eGRU model is implemented and tested on the Arm Cortex M0-based Atmel ATSAMD21E18 processor. The Arm M0+ implementation of the eGRU model compares favorably with a full precision GRU that is running on a workstation. The embedded eGRU model achieves a classification accuracy 95.3%, which is only 2% less than the full precision GRU.

show abstract

Improving Speech Recognition by Revising Gated Recurrent Units

Cited by 44 publications

References 26 publications

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Multi-Task Self-Supervised Learning for Robust Speech Recognition

The Pytorch-kaldi Speech Recognition Toolkit

An Optimized Recurrent Unit for Ultra-Low-Power Keyword Spotting

Contact Info

Product

Resources

About