Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1953
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Mode Transformer Transducer with Stochastic Future Context

Abstract: Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naïvely, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 23 publications
(37 reference statements)
0
2
0
Order By: Relevance
“…In this manner, different AED models, varying from causal attention to full attention, are jointly trained with shared weights. Multi-mode Transformer Transducer [18] proposes to consider stochastic future contexts during training, so that the trained model is capable to serve in various latency budget scenarios without significant accuracy deterioration.…”
Section: Unifying Streaming and Non-streaming Asrmentioning
confidence: 99%
See 1 more Smart Citation
“…In this manner, different AED models, varying from causal attention to full attention, are jointly trained with shared weights. Multi-mode Transformer Transducer [18] proposes to consider stochastic future contexts during training, so that the trained model is capable to serve in various latency budget scenarios without significant accuracy deterioration.…”
Section: Unifying Streaming and Non-streaming Asrmentioning
confidence: 99%
“…The simulation module is jointly trained with the ASR model using a self-supervised loss; the ASR model is optimized with the usual ASR loss, e.g., CTC-CRF [15] as used in our experiments. The CUSIDE framework can be further enhanced with several recently developed techniques for streaming ASR, such as weight sharing and joint training of a streaming model with a full-context model [16], chunk size jitter [9,17] and stochastic future context [18] in training. Experiments demonstrate that, compared to using real future frames as right context, using simulated future context can drastically reduce latency while maintaining recognition accuracy.…”
Section: Introductionmentioning
confidence: 99%