A network of deep neural networks for Distant Speech Recognition

Ravanelli, Mirco; Brakel, Philémon; Omologo, Maurizio; Bengio, Yoshua

doi:10.1109/icassp.2017.7953084

Cited by 32 publications

(29 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Batch normalization [3] has been recently proposed in the machine learning community and addresses the so-called internal covariate shift problem by normalizing the mean and the variance of each layer's pre-activations for each training minibatch. Several works have already shown that this technique is effective both to improve the system performance and to speed-up the training procedure [20], [22], [37], [49], [50]. Batch normalization can be applied to RNNs in different ways.…”

Section: Batch Normalizationmentioning

confidence: 99%

“…Despite the progress of the last decade, state-of-the-art speech recognizers are still far away from reaching satisfactory robustness and flexibility. This lack of robustness typically happens when facing challenging acoustic conditions [13], characterized by considerable levels of non-stationary noise and acoustic reverberation [14]- [22]. The development of robust ASR has been recently fostered by the great success of some international challenges such as CHiME [23], REVERB [24] and ASpIRE [25], which were also extremely useful to establish common evaluation frameworks among researchers.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Light Gated Recurrent Units for Speech Recognition

Ravanelli¹,

Brakel

Omologo³

et al. 2018

IEEE Trans. Emerg. Top. Comput. Intell.

Self Cite

297

189

View full text Add to dashboard Cite

A field that has directly benefited from the recent advances in deep learning is Automatic Speech Recognition (ASR). Despite the great achievements of the past decades, however, a natural and robust human-machine speech interaction still appears to be out of reach, especially in challenging environments characterized by significant noise and reverberation. To improve robustness, modern speech recognizers often employ acoustic models based on Recurrent Neural Networks (RNNs), that are naturally able to exploit large time contexts and long-term speech modulations. It is thus of great interest to continue the study of proper techniques for improving the effectiveness of RNNs in processing speech signals.In this paper, we revise one of the most popular RNN models, namely Gated Recurrent Units (GRUs), and propose a simplified architecture that turned out to be very effective for ASR. The contribution of this work is two-fold: First, we analyze the role played by the reset gate, showing that a significant redundancy with the update gate occurs. As a result, we propose to remove the former from the GRU design, leading to a more efficient and compact single-gate model. Second, we propose to replace hyperbolic tangent with ReLU activations. This variation couples well with batch normalization and could help the model learn long-term dependencies without numerical issues.Results show that the proposed architecture, called Light GRU (Li-GRU), not only reduces the per-epoch training time by more than 30% over a standard GRU, but also consistently improves the recognition accuracy across different tasks, input features, noisy conditions, as well as across different ASR paradigms, ranging from standard DNN-HMM speech recognizers to endto-end CTC models.

show abstract

Section: Batch Normalizationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Light Gated Recurrent Units for Speech Recognition

Ravanelli¹,

Brakel

Omologo³

et al. 2018

IEEE Trans. Emerg. Top. Comput. Intell.

Self Cite

297

189

View full text Add to dashboard Cite

show abstract

“…Deep learning has shown remarkable success in numerous speech tasks [1], including speech [2,3] and speaker recognition [4]. This paradigm exploits the principle of compositionality to efficiently describe the world around us and employs a hierarchy of representations that are progressively learned by combining lower-level abstractions.…”

Section: Introductionmentioning

confidence: 99%

Speaker Recognition from Raw Waveform with SincNet

Ravanelli

Bengio

2018

2018 IEEE Spoken Language Technology Workshop (SLT)

Self Cite

555

500

View full text Add to dashboard Cite

Deep neural networks can learn complex and abstract representations, that are progressively obtained by combining simpler ones. A recent trend in speech and speaker recognition consists in discovering these representations starting from raw audio samples directly. Differently from standard hand-crafted features such as MFCCs or FBANK, the raw waveform can potentially help neural networks discover better and more customized representations. The highdimensional raw inputs, however, can make training significantly more challenging.This paper summarizes our recent efforts to develop a neural architecture that efficiently processes speech from audio waveforms. In particular, we propose SincNet, a novel Convolutional Neural Network (CNN) that encourages the first layer to discover meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized front-end, that only depends on some parameters with a clear physical meaning.Our experiments, conducted on both speaker and speech recognition, show that the proposed architecture converges faster, performs better, and is more computationally efficient than standard CNNs.

show abstract

“…In the past few years, deep neural networks (DNN) [1] have made tremendous advances, in some cases surpassing human level performance, tackling challenging problems such as speech recognition [2] [3], natural language processing [4] [5], image classification [6] [7] [8], and machine translation [9]. Training of large DNNs, however, is a time consuming and computationally intensive task that demands datacenter scale computational resources composed of state of the art GPUs [6] [10].…”

Section: Introductionmentioning

confidence: 99%

Algorithm for Training Neural Networks on Resistive Device Arrays

Gokmen

Haensch

2020

Front. Neurosci.

View full text Add to dashboard Cite

Hardware architectures composed of resistive cross-point device arrays can provide significant power and speed benefits for deep neural network training workloads using stochastic gradient descent (SGD) and backpropagation (BP) algorithm. The training accuracy on this imminent analog hardware however strongly depends on the switching characteristics of the cross-point elements. One of the key requirements is that these resistive devices must change conductance in a symmetrical fashion when subjected to positive or negative pulse stimuli. Here, we present a new training algorithm, so-called the "Tiki-Taka" algorithm, that eliminates this stringent symmetry requirement. We show that device asymmetry introduces an unintentional implicit cost term into the SGD algorithm, whereas in the "Tiki-Taka" algorithm a coupled dynamical system simultaneously minimizes the original objective function of the neural network and the unintentional cost term due to device asymmetry in a self-consistent fashion. We tested the validity of this new algorithm on a range of network architectures such as fully connected, convolutional and LSTM networks. Simulation results on these various networks show that whatever accuracy is achieved using the conventional SGD algorithm with symmetric (ideal) device switching characteristics the same accuracy is also achieved using the "Tiki-Taka" algorithm with non-symmetric (non-ideal) device switching characteristics. Moreover, all the operations performed on the arrays are still parallel and therefore the implementation cost of this new algorithm on array architectures is minimal; and it maintains the aforementioned power and speed benefits. These algorithmic improvements are crucial to relax the material specification and to realize technologically viable resistive crossbar arrays that outperform digital accelerators for similar training tasks.

show abstract

A network of deep neural networks for Distant Speech Recognition

Cited by 32 publications

References 30 publications

Light Gated Recurrent Units for Speech Recognition

Light Gated Recurrent Units for Speech Recognition

Speaker Recognition from Raw Waveform with SincNet

Algorithm for Training Neural Networks on Resistive Device Arrays

Contact Info

Product

Resources

About