ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413535
|View full text |Cite
|
Sign up to set email alerts
|

Developing Real-Time Streaming Transformer Transducer for Speech Recognition on Large-Scale Dataset

Abstract: Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to prevent their applications. In this work, we explored the potential of Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset. We combine the idea of Transformer-XL and chunk-wise streaming processing to design a st… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
41
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 102 publications
(41 citation statements)
references
References 35 publications
0
41
0
Order By: Relevance
“…For the streaming ASR model, we used a TT with a chunkwise look-ahead proposed in [39]. The encoder consists of 2 convolution layers, each of which halves the time resolution, followed by the a 18-layer or 36-layer transformer with relative positional encoding.…”
Section: Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the streaming ASR model, we used a TT with a chunkwise look-ahead proposed in [39]. The encoder consists of 2 convolution layers, each of which halves the time resolution, followed by the a 18-layer or 36-layer transformer with relative positional encoding.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…The audio input feature is an 80-dim log mel-filterbank extracted every 10 msec. As proposed in [39], we can control the algorithmic latency of We used an AdamW optimizer with a linear decay learning rate schedule with a peak learning rate of 1.5e-3 after 25K warm up iterations.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…The streaming mask function m(t) generates limited time frames [ts : te] ∈ [1 : T ] to compute the attention at time frame t by using an attention mask. With an appropriate attention mask, TT can be executed with limited algorithmic latency as proposed in [33]. Finally, FF * l () represents a position-wise feed forward network at the l-th layer.…”
Section: Transformer Transducermentioning
confidence: 99%
“…For the ASR block, we used a 18-layer or 36-layer TT (TT-18 or TT-36 in short) with the chunk-wise look-ahead proposed in [33] by using exactly the same configuration as in [29]. Each transformer block consisted of a 512-dim MHA with 8 heads and a 2048-dim point-wise feedforward layer.…”
Section: Experimental Settingsmentioning
confidence: 99%
“…We employ Transformer Transducer model [20,21,22] as the backbone model structure. In the pre-training stage, we conduct the multi-task learning proposed in our previous work UniSpeech [16], where transducer loss and contrastive loss are combined together.…”
Section: Introductionmentioning
confidence: 99%