2019
DOI: 10.48550/arxiv.1910.10352
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Abstract: Transformer with self-attention has achieved great success in the area of nature language processing. Recently, there have been a few studies on transformer for end-to-end speech recognition, while its application for hybrid acoustic model is still very limited. In this paper, we revisit the transformer-based hybrid acoustic model, and propose a model structure with interleaved self-attention and 1D convolution, which is proven to have faster convergence and higher recognition accuracy. We also study several a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…With the interleaved convolution, the accuracy loss is much smaller. This may be due to that the interleaved convolution layers can compensate the reordering effect of self-attention operations, and maintain the monotonicity of the input sequence [25]. When the…”
Section: Results Of Streaming Transformersmentioning
confidence: 99%
See 2 more Smart Citations
“…With the interleaved convolution, the accuracy loss is much smaller. This may be due to that the interleaved convolution layers can compensate the reordering effect of self-attention operations, and maintain the monotonicity of the input sequence [25]. When the…”
Section: Results Of Streaming Transformersmentioning
confidence: 99%
“…The self-attention operation cannot maintain the monotonicity of input sequence, which is particularly harmful for timesynchronous acoustic model such as the hybrid model studies in this paper. The positional encoding approach in [2] is shown to be less effective to the speech recognition problem [12,25], while convolutional layers are proven to be more powerful to encode the positional information. In Table 1, we compare the two schemes of using convolution layers in Transformers in the offline condition, namely, the interleaved 1D convolution with self-attention from our previous study [25], and using the VGG net [26] as the input encoding layer.…”
Section: Convolution Layers and Attention Headsmentioning
confidence: 99%
See 1 more Smart Citation