Improved Hybrid Streaming ASR with Transformer Language Models

Baquero-Arnal, Pau; Jorge, Javier; Giménez, Adrià; Silvestre-Cerdà, Joan Albert; Iranzo-Sánchez, Javier; Sanchís, Alberto; Civera, Jorge; Juan, Alfons

doi:10.21437/interspeech.2020-2770

Cited by 10 publications

(12 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Afterwards, the mean is dynamically updated for every new frame. In previous works, we proved that two seconds of initial delay should be enough to achieve similar performance to FSN [27], [28]. Although, two seconds of delay could be reasonable in a continuous streaming setup, it could be not so suitable for short utterances such as voice commands.…”

Section: Acoustic Feature Normalization For Streamingmentioning

confidence: 89%

“…Not surprisingly, empirical assessment of this extended architecture under strict streaming conditions proved it was really effective, indeed keeping the pace with non-streaming (offline) systems. The most recent refinement in connection to this research line has consisted in replacing streaming-adapted LSTM-RNN LMs with Transformer LMs [28]. In doing so, empirical results on the well-known LibriSpeech [29] and TED-LIUM [30] tasks have shown that this refinement leads to top, state-of-theart recognition rates and latencies under streaming conditions.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Jorge

Giménez

Silvestre-Cerdà

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows on a main Spanish broadcasting station.

show abstract

Section: Acoustic Feature Normalization For Streamingmentioning

confidence: 89%

Section: Introductionmentioning

confidence: 99%

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Jorge

Giménez

Silvestre-Cerdà

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…The most successful end-to-end ASR systems are based on connectionist temporal classification (CTC) [18], recurrent neural network (RNN) transducer (RNN-T) [17], and attention-based encoder-decoder architectures [19]. Recently, hybrid model systems have shown significant improvements in accuracy for streaming ASR [20,21]. Transformer is a sequence-to-sequence architecture originally proposed for machine translation [22].…”

Section: Background Asrmentioning

confidence: 99%

Performance and Efficiency Evaluation of ASR Inference on the Edge

Gondi

Pratap

2021

Sustainability

View full text Add to dashboard Cite

Automatic speech recognition, a process of converting speech signals to text, has improved a great deal in the past decade thanks to the deep learning based systems. With the latest transformer based models, the recognition accuracy measured as word-error-rate (WER), is even below the human annotator error (4%). However, most of these advanced models run on big servers with large amounts of memory, CPU/GPU resources and have huge carbon footprint. This server based architecture of ASR is not viable in the long run given the inherent lack of privacy for user data, reliability and latency issues of the network connection. On the other hand, on-device ASR (meaning, speech to text conversion on the edge device itself) solutions will fix deep-rooted privacy issues while at same time being more reliable and performant by avoiding network connectivity to the back-end server. On-device ASR can also lead to a more sustainable solution by considering the energy vs. accuracy trade-off and choosing right model for specific use cases/applications of the product. Hence, in this paper we evaluate energy-accuracy trade-off of ASR with a typical transformer based speech recognition model on an edge device. We have run evaluations on Raspberry Pi with an off-the-shelf USB meter for measuring energy consumption. We conclude that, in the case of CPU based ASR inference, the energy consumption grows exponentially as the word error rate improves linearly. Additionally, based on our experiment we deduce that, with PyTorch mobile optimization and quantization, the typical transformer based ASR on edge performs reasonably well in terms of accuracy and latency and comes close to the accuracy of server based inference.

show abstract

“…Directly modelling long-span word histories using conventional back-off n-gram models [1] generally leads to a severe data sparsity issue [2]. To this end, over the past few decades there have been significant efforts of developing artificial neural network based language modelling techniques in the speech technology community [3]- [14]. Neural network language models (NNLMs) representing longer span history contexts in a continuous and lower dimen-sional vector space, are used to improve the generalization performance.…”

Section: Introductionmentioning

confidence: 99%

“…With the rapid progress of deep neural network (DNN) based ASR technologies in recent decades, the underlying network architectures of NNLMs have evolved from feedforward structures [3]- [7] to more advanced variants represented by long-short term memory recurrent neural networks (LSTM-RNNs) [8]- [10], [15] and more recently neural Transformers [11]- [14], [16] that are designed for modelling longer range contexts. In particular, Transformer based language models in recent years have defined state-of-the-art performance across a range of ASR task domains [11]- [14], [17]. These models [11]- [13], [17] are often constructed using a deep stacking of multiple self-attention based neural building blocks [18]- [20], each of which also includes residual connections [21] and layer normalization modules [22].…”

Section: Introductionmentioning

confidence: 99%

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

Xu,

Yu,

et al. 2021

Preprint

View full text Add to dashboard Cite

State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network quantization provides a powerful solution to dramatically reduce their model size. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. To this end, novel mixed precision neural network LM quantization methods are proposed in this paper. The optimal local precision choices for LSTM-RNN and Transformer based neural LMs are automatically learned using three techniques. The first two approaches are based on quantization sensitivity metrics in the form of either the KL-divergence measured between full precision and quantized LMs, or Hessian trace weighted quantization perturbation that can be approximated efficiently using matrix free techniques. The third approach is based on mixed precision neural architecture search. In order to overcome the difficulty in using gradient descent methods to directly estimate discrete quantized weights, alternating direction methods of multipliers (ADMM) are used to efficiently train quantized LMs. Experiments were conducted on state-of-theart LF-MMI CNN-TDNN systems featuring speed perturbation, i-Vector and learning hidden unit contribution (LHUC) based speaker adaptation on two tasks: Switchboard telephone speech and AMI meeting transcription. The proposed mixed precision quantization techniques achieved "lossless" quantization on both tasks, by producing model size compression ratios of up to approximately 16 times over the full precision LSTM and Transformer baseline LMs, while incurring no statistically significant word error rate increase.

show abstract

Improved Hybrid Streaming ASR with Transformer Language Models

Cited by 10 publications

References 15 publications

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Performance and Efficiency Evaluation of ASR Inference on the Edge

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

Contact Info

Product

Resources

About