Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054267
|View full text |Cite
|
Sign up to set email alerts
|

LSTM-Based One-Pass Decoder for Low-Latency Streaming

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
3

Relationship

2
4

Authors

Journals

citations
Cited by 13 publications
(12 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…S(K) RW(K) V(K) Opensubtitles [22] 212635 1146861 1576 UFAL [23] 92873 910728 2179 Wikipedia [24] 32686 586068 3373 UN [25] 11196 For streaming ASR, as the full sequence (context) is not available during decoding, BLSTM AMs are queried with a sliding, overlapping context window of limited size over the input sequence, averaging outputs of all windows for each frame to obtain the corresponding acoustic score [30]. The size of the context window (in frames or seconds) is set in decoding, and defines the theoretical latency of the system.…”
Section: Corpusmentioning
confidence: 99%
“…S(K) RW(K) V(K) Opensubtitles [22] 212635 1146861 1576 UFAL [23] 92873 910728 2179 Wikipedia [24] 32686 586068 3373 UN [25] 11196 For streaming ASR, as the full sequence (context) is not available during decoding, BLSTM AMs are queried with a sliding, overlapping context window of limited size over the input sequence, averaging outputs of all windows for each frame to obtain the corresponding acoustic score [30]. The size of the context window (in frames or seconds) is set in decoding, and defines the theoretical latency of the system.…”
Section: Corpusmentioning
confidence: 99%
“…Afterwards, the mean is dynamically updated for every new frame. In previous works, we proved that two seconds of initial delay should be enough to achieve similar performance to FSN [27], [28]. Although, two seconds of delay could be reasonable in a continuous streaming setup, it could be not so suitable for short utterances such as voice commands.…”
Section: Acoustic Feature Normalization For Streamingmentioning
confidence: 90%
“…Moreover, LSTM-RNN LM probabilities were efficiently computed using Variance Regularization and lazy evaluation. Later on, in [27], this architecture for real-time one-pass decoding was extended to include BLSTM acoustic models within a time sliding window, also used as a window for timeconstrained, on-the-fly acoustic feature normalization. Not surprisingly, empirical assessment of this extended architecture under strict streaming conditions proved it was really effective, indeed keeping the pace with non-streaming (offline) systems.…”
Section: Introductionmentioning
confidence: 99%
“…Neural language models (LMs), including long short-term memory (LSTM) and Transformer based ones, have significantly improved performance over n-gram LMs in automatic speech recognition (ASR) [1,2,3,4,5,6,7]. Since it is challenging for one-pass decoding with a neural LM to obtain competitive performance with lower latency than a two-pass approach [8,9,10,11,12], a widely adopted way is still to use neural LMs to rescore N -best hypotheses (alternative word-sequences) or lattices that are decoded with an n-gram LM [13,14,15,10,16,17,18,19]. A lattice is a compact representation of the hypothesis space for an utterance.…”
Section: Introductionmentioning
confidence: 99%