LSTM-Based One-Pass Decoder for Low-Latency Streaming

Jorge, Javier; Giménez, Adrià; Iranzo-Sánchez, Javier; Silvestre-Cerdà, Joan Albert; Civera, Jorge; Sanchís, Alberto; Juan, Alfons

doi:10.1109/icassp40776.2020.9054267

Cited by 13 publications

(12 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…S(K) RW(K) V(K) Opensubtitles [22] 212635 1146861 1576 UFAL [23] 92873 910728 2179 Wikipedia [24] 32686 586068 3373 UN [25] 11196 For streaming ASR, as the full sequence (context) is not available during decoding, BLSTM AMs are queried with a sliding, overlapping context window of limited size over the input sequence, averaging outputs of all windows for each frame to obtain the corresponding acoustic score [30]. The size of the context window (in frames or seconds) is set in decoding, and defines the theoretical latency of the system.…”

Section: Corpusmentioning

confidence: 99%

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Jorge¹,

Giménez²,

Baquero-Arnal³

et al. 2021

IberSPEECH 2021

Self Cite

View full text Add to dashboard Cite

This paper describes the automatic speech recognition (ASR) systems built by the MLLP-VRAIN research group of Universitat Politècnica de València for the Albayzin-RTVE 2020 Speech-to-Text Challenge.The primary system (p-streaming 1500ms nlt) was a hybrid BLSTM-HMM ASR system using streaming one-pass decoding with a context window of 1.5 seconds and a linear combination of an n-gram, a LSTM, and a Transformer language model (LM). The acoustic model was trained on nearly 4,000 hours of speech data from different sources, using the MLLP's transLectures-UPV toolkit (TLK) and TensorFlow; whilst LMs were trained using SRILM (n-gram), CUED-RNNLM (LSTM), and Fairseq (Transformer), with up to 102G tokens. This system achieved 11.6% and 16.0% WER on the test-2018 and test-2020 sets, respectively. As it is streaming-enabled, it could be put into production environments for automatic captioning of live media streams, with a theoretical delay of 1.5 seconds.Along with the primary system, we also submitted three contrastive systems. From these, we highlight the system c2streaming 600ms t that, following the same configuration of the primary one, but using a smaller context window of 0.6 seconds and a Transformer LM, scored 12.3% and 16.9% WER points respectively on the same test sets, with a measured empirical latency of 0.81±0.09 seconds (mean±stdev). This is, we obtained state-of-the-art latencies for high-quality automatic live captioning with a small WER degradation of 6% relative.

show abstract

Section: Corpusmentioning

confidence: 99%

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Jorge¹,

Giménez²,

Baquero-Arnal³

et al. 2021

IberSPEECH 2021

Self Cite

View full text Add to dashboard Cite

show abstract

“…Afterwards, the mean is dynamically updated for every new frame. In previous works, we proved that two seconds of initial delay should be enough to achieve similar performance to FSN [27], [28]. Although, two seconds of delay could be reasonable in a continuous streaming setup, it could be not so suitable for short utterances such as voice commands.…”

Section: Acoustic Feature Normalization For Streamingmentioning

confidence: 90%

“…Moreover, LSTM-RNN LM probabilities were efficiently computed using Variance Regularization and lazy evaluation. Later on, in [27], this architecture for real-time one-pass decoding was extended to include BLSTM acoustic models within a time sliding window, also used as a window for timeconstrained, on-the-fly acoustic feature normalization. Not surprisingly, empirical assessment of this extended architecture under strict streaming conditions proved it was really effective, indeed keeping the pace with non-streaming (offline) systems.…”

Section: Introductionmentioning

confidence: 99%

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

Jorge

Giménez

Silvestre-Cerdà

et al. 2022

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Although Long-Short Term Memory (LSTM) networks and deep Transformers are now extensively used in offline ASR, it is unclear how best offline systems can be adapted to work with them under the streaming setup. After gaining considerable experience on this regard in recent years, in this paper we show how an optimized, low-latency streaming decoder can be built in which bidirectional LSTM acoustic models, together with general interpolated language models, can be nicely integrated with minimal perfomance degradation. In brief, our streaming decoder consists of a one-pass, real-time search engine relying on a limited-duration window sliding over time and a number of ad hoc acoustic and language model pruning techniques. Extensive empirical assessment is provided on truly streaming tasks derived from the well-known LibriSpeech and TED talks datasets, as well as from TV shows on a main Spanish broadcasting station.

show abstract

“…Neural language models (LMs), including long short-term memory (LSTM) and Transformer based ones, have significantly improved performance over n-gram LMs in automatic speech recognition (ASR) [1,2,3,4,5,6,7]. Since it is challenging for one-pass decoding with a neural LM to obtain competitive performance with lower latency than a two-pass approach [8,9,10,11,12], a widely adopted way is still to use neural LMs to rescore N -best hypotheses (alternative word-sequences) or lattices that are decoded with an n-gram LM [13,14,15,10,16,17,18,19]. A lattice is a compact representation of the hypothesis space for an utterance.…”

Section: Introductionmentioning

confidence: 99%

A Parallelizable Lattice Rescoring Strategy with Neural Language Models

Povey

Khudanpur

2021

Preprint

View full text Add to dashboard Cite

This paper proposes a parallel computation strategy and a posteriorbased lattice expansion algorithm for efficient lattice rescoring with neural language models (LMs) for automatic speech recognition. First, lattices from first-pass decoding are expanded by the proposed posterior-based lattice expansion algorithm. Second, each expanded lattice is converted into a minimal list of hypotheses that covers every arc. Each hypothesis is constrained to be the best path for at least one arc it includes. For each lattice, the neural LM scores of the minimal list are computed in parallel and are then integrated back to the lattice in the rescoring stage. Experiments on the Switchboard dataset show that the proposed rescoring strategy obtains comparable recognition performance and generates more compact lattices than a competitive baseline method. Furthermore, the parallel rescoring method offers more flexibility by simplifying the integration of PyTorch-trained neural LMs for lattice rescoring with Kaldi.

show abstract

LSTM-Based One-Pass Decoder for Low-Latency Streaming

Cited by 13 publications

References 16 publications

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

MLLP-VRAIN Spanish ASR Systems for the Albayzin-RTVE 2020 Speech-To-Text Challenge

Live Streaming Speech Recognition Using Deep Bidirectional LSTM Acoustic Models and Interpolated Language Models

A Parallelizable Lattice Rescoring Strategy with Neural Language Models

Contact Info

Product

Resources

About