Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Inaguma, Hirofumi; Gaur, Yashesh; Lu, Liang; Li, Jinyu; Gong, Yifan

doi:10.1109/icassp40776.2020.9054098

Cited by 42 publications

(70 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this section, we present latency metrics for streaming ASR and techniques to reduce them. Different from the average token latency [21], we adopt metrics that are directly related to streaming speech applications such as Voice Search and Assistant.…”

Section: Latency Improvementsmentioning

confidence: 99%

“…It has been shown to give significant lower latency while retaining recognition accuracy on different RNN-T models. More importantly, FastEmit does not require any prior alignment information [20,21] and has no additional training or serving cost.…”

Section: Fastemitmentioning

confidence: 99%

“…In this work, we explore ideas to improve the model's partial latency. Speech-to-text alignment can be used to improve the token prediction timing [20,21], which normally causes large quality degradation. We adopted the recently proposed FastEmit [3].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Better and Faster end-to-end Model for Streaming ASR

Gulati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end (E2E) models have shown to outperform state-of-theart conventional models for streaming speech recognition [1] across many dimensions, including quality (as measured by word error rate (WER)) and endpointer latency [2]. However, the model still tends to delay the predictions towards the end and thus has much higher partial latency compared to a conventional ASR model. To address this issue, we look at encouraging the E2E model to emit words early, through an algorithm called FastEmit [3]. Naturally, improving on latency results in a quality degradation. To address this, we explore replacing the LSTM layers in the encoder of our E2E model with Conformer layers [4], which has shown good improvements for ASR. Secondly, we also explore running a 2nd-pass beam search to improve quality. In order to ensure the 2nd-pass completes quickly, we explore non-causal Conformer layers that feed into the same 1stpass RNN-T decoder, an algorithm called Cascaded Encoders [5]. Overall, the Conformer RNN-T with Cascaded Encoders offers a better quality and latency tradeoff for streaming ASR.

show abstract

Section: Latency Improvementsmentioning

confidence: 99%

Section: Fastemitmentioning

confidence: 99%

See 1 more Smart Citation

A Better and Faster end-to-end Model for Streaming ASR

Gulati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Using wordpieces as labels require input embedding and no pointers are provided for leveraging the already existing vast linguistic resources in non-wordpiece form. In [20], an attention-based sequence-to-sequence model with pre-training on frame-wise classification tasks is presented for achieving streaming capability. In [24], a latency controlled bidirectional LSTM with 1.2 seconds sized chunks is used.…”

Section: Background and Related Workmentioning

confidence: 99%

A Low footprint Automatic Speech Recognition System For Resource Constrained Edge Devices

Dey

Dutta

2020

Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Thi

View full text Add to dashboard Cite

Deep Learning (DL) has been instrumental in pushing artificial intelligence (AI)/ machine learning (ML) algorithms to edge of the network. It allows building AI/ML algorithms for computer vision, speech processing, and other timeseries analytics tasks with limited domain knowledge. As there is no mechanism to control the representations learned from a large dataset, it becomes hard to predict whether a very small DL model can learn the proper dependencies needed for a particular problem at hand. With speech recognition capability becoming important in several Internet of Things (IoT) devices, we propose an explainable AI-based methodology to build small DL models for speech recognition by controlling the representations learned by a model under a hard size constraint. We enhance the architecture of a state of the art sequence transduction model to allow the tuning of accuracy vs. model size tradeoff. Using these techniques we achieve a reduction in model size and latency by a factor of 10 and 6 respectively, with only 4loss compared to the embedded implementation of a well known ASR. CCS CONCEPTS • Human-centered computing → Ubiquitous and mobile computing design and evaluation methods; • Computing methodologies → Speech recognition; Neural networks; • Computer systems organization → Embedded software.

show abstract

“…In order to achieve simultaneous speech-tospeech translation (SSST), to the best of our knowledge, most recent approaches (Oda et al, 2014; dismantle the entire system into a three-step pipelines, streaming Automatic Speech Recognition (ASR) (Sainath et al, 2020;Inaguma et al, 2020;Li et al, 2020), simultaneous Text-to-Text translation (sT2T) (Gu et al, 2017;Ma et al, 2019;Arivazhagan et al, 2019;, and Text-to-Speech (TTS) synthesis (Wang et al, 2017;Ping et al, 2017;Oord et al, 2017). Most recent efforts mainly focus on sT2T which is considered the key component to further reduce the translation latency and improve the translation quality for the entire pipeline.…”

Section: Introductionmentioning

confidence: 99%

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Zheng

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Simultaneous speech-to-speech translation is widely useful but extremely challenging, since it needs to generate target-language speech concurrently with the source-language speech, with only a few seconds delay. In addition, it needs to continuously translate a stream of sentences, but all recent solutions merely focus on the single-sentence scenario. As a result, current approaches accumulate latencies progressively when the speaker talks faster, and introduce unnatural pauses when the speaker talks slower. To overcome these issues, we propose Self-Adaptive Translation (SAT) which flexibly adjusts the length of translations to accommodate different source speech rates. At similar levels of translation quality (as measured by BLEU), our method generates more fluent target speech (as measured by the naturalness metric MOS) with substantially lower latency than the baseline, in both Zh↔En directions. * See our speech-to-speech simultaneous translation demos (including comparison with human interpreters) at https://sat-demo.github.io.

show abstract

Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR

Cited by 42 publications

References 24 publications

A Better and Faster end-to-end Model for Streaming ASR

A Better and Faster end-to-end Model for Streaming ASR

A Low footprint Automatic Speech Recognition System For Resource Constrained Edge Devices

Fluent and Low-latency Simultaneous Speech-to-Speech Translation with Self-adaptive Training

Contact Info

Product

Resources

About