The Pytorch-kaldi Speech Recognition Toolkit

Ravanelli, Mirco; Parcollet, Titouan; Bengio, Yoshua

doi:10.1109/icassp.2019.8683713

Cited by 217 publications

(158 citation statements)

References 22 publications

Supporting

Mentioning

135

Contrasting

Order By: Relevance

“…This work uses hybrid HMM-DNN speech recognizers. TIMIT and DIRHA experiments are performed with the PyTorch-Kaldi toolkit [35] using a six-layer multi-layer perceptron and a light GRU [36], respectively. The performance reported on TIMIT is the average of the phone error rates (PER%) obtained by running each experiment three times with different seeds.…”

Section: Corpora and Asr Setupmentioning

confidence: 99%

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Ravanelli

Zhong

Pascual

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

225

140

View full text Add to dashboard Cite

Despite the growing interest in unsupervised learning, extracting meaningful knowledge from unlabelled audio remains an open challenge. To take a step in this direction, we recently proposed a problem-agnostic speech encoder (PASE), that combines a convolutional encoder followed by multiple neural networks, called workers, tasked to solve self-supervised problems (i.e., ones that do not require manual annotations as ground truth). PASE was shown to capture relevant speech information, including speaker voice-print and phonemes. This paper proposes PASE+, an improved version of PASE for robust speech recognition in noisy and reverberant environments. To this end, we employ an online speech distortion module, that contaminates the input signals with a variety of random disturbances. We then propose a revised encoder that better learns short-and long-term speech dynamics with an efficient combination of recurrent and convolutional networks. Finally, we refine the set of workers used in self-supervision to encourage better cooperation.Results on TIMIT, DIRHA and CHiME-5 show that PASE+ significantly outperforms both the previous version of PASE as well as common acoustic features. Interestingly, PASE+ learns transferable representations suitable for highly mismatched acoustic conditions.

show abstract

Section: Corpora and Asr Setupmentioning

confidence: 99%

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Ravanelli

Zhong

Pascual

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

225

140

View full text Add to dashboard Cite

show abstract

“…The active development of open-source software toolkits plays a significant role in the rapid progress of ASR research, instances include the Kaldi (Povey et al, 2011) and ESPnet (Watanabe et al, 2018). In this work, we demonstrate that state-of-the-art SNN acoustic models can be easily developed in PyTorch and integrated into the PyTorch-Kaldi Speech Recognition Toolkit (Ravanelli et al, 2019). This software toolkit integrates the efficiency of Kaldi and the flexibility of PyTorch, therefore, it can support the rapid development of SNN-based ASR systems.…”

Section: Development Of Snn-based Asr Systemsmentioning

confidence: 73%

“…All ASR experiments are performed using the PyTorch-Kaldi ASR toolkit (Ravanelli et al, 2019). This recently introduced toolkit inherits the flexibility of PyTorch toolkit (Paszke et al, 2017) for ANN-based acoustic model development and the efficiency of Kaldi ASR toolkit (Povey et al, 2011).…”

Section: Implementation Detailsmentioning

confidence: 99%

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

Yılmaz

Zhang

et al. 2020

Front. Neurosci.

View full text Add to dashboard Cite

Artificial neural networks (ANN) have become the mainstream acoustic modeling technique for large vocabulary automatic speech recognition (ASR). A conventional ANN features a multi-layer architecture that requires massive amounts of computation. The brain-inspired spiking neural networks (SNN) closely mimic the biological neural networks and can operate on low-power neuromorphic hardware with spike-based computation. Motivated by their unprecedented energyefficiency and rapid information processing capability, we explore the use of SNNs for speech recognition. In this work, we use SNNs for acoustic modeling and evaluate their performance on several large vocabulary recognition scenarios. The experimental results demonstrate competitive ASR accuracies to their ANN counterparts, while require significantly reduced computational cost and inference time. Integrating the algorithmic power of deep SNNs with energy-efficient neuromorphic hardware, therefore, offer an attractive solution for ASR applications running locally on mobile and embedded devices.

show abstract

“…The learning rate is halved every-time the loss on the validation set is below a certain threshold fixed to 0.001 to avoid overfitting. Finally, models are implemented with the Pytorch-Kaldi toolkit [18]. While the effectiveness of QLSTM over LSTM has been demonstrated, an LSTM network trained in the same conditions and based on [5] is considered as a baseline.…”

Section: Model Architecturesmentioning

confidence: 99%

Real to H-Space Encoder for Speech Recognition

Parcollet¹,

Morchid²,

Linarès³

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Deep neural networks (DNNs) and more precisely recurrent neural networks (RNNs) are at the core of modern automatic speech recognition systems, due to their efficiency to process input sequences. Recently, it has been shown that different input representations, based on multidimensional algebras, such as complex and quaternion numbers, are able to bring to neural networks a more natural, compressive and powerful representation of the input signal by outperforming common real-valued NNs. Indeed, quaternion-valued neural networks (QNNs) better learn both internal dependencies, such as the relation between the Mel-filter-bank value of a specific time frame and its time derivatives, and global dependencies, describing the relations that exist between time frames. Nonetheless, QNNs are limited to quaternion-valued input signals, and it is difficult to benefit from this powerful representation with real-valued input data. This paper proposes to tackle this weakness by introducing a real-to-quaternion encoder that allows QNNs to process any one dimensional input features, such as traditional Mel-filter-banks for automatic speech recognition.

show abstract

The Pytorch-kaldi Speech Recognition Toolkit

Cited by 217 publications

References 22 publications

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Multi-Task Self-Supervised Learning for Robust Speech Recognition

Deep Spiking Neural Networks for Large Vocabulary Automatic Speech Recognition

Real to H-Space Encoder for Speech Recognition

Contact Info

Product

Resources

About