A Transformer with Interleaved Self-attention and Convolution for Hybrid Acoustic Models

Lu, Liang

doi:10.48550/arxiv.1910.10352

Cited by 3 publications

(3 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the interleaved convolution, the accuracy loss is much smaller. This may be due to that the interleaved convolution layers can compensate the reordering effect of self-attention operations, and maintain the monotonicity of the input sequence [25]. When the…”

Section: Results Of Streaming Transformersmentioning

confidence: 99%

“…The self-attention operation cannot maintain the monotonicity of input sequence, which is particularly harmful for timesynchronous acoustic model such as the hybrid model studies in this paper. The positional encoding approach in [2] is shown to be less effective to the speech recognition problem [12,25], while convolutional layers are proven to be more powerful to encode the positional information. In Table 1, we compare the two schemes of using convolution layers in Transformers in the offline condition, namely, the interleaved 1D convolution with self-attention from our previous study [25], and using the VGG net [26] as the input encoding layer.…”

Section: Convolution Layers and Attention Headsmentioning

confidence: 99%

“…The positional encoding approach in [2] is shown to be less effective to the speech recognition problem [12,25], while convolutional layers are proven to be more powerful to encode the positional information. In Table 1, we compare the two schemes of using convolution layers in Transformers in the offline condition, namely, the interleaved 1D convolution with self-attention from our previous study [25], and using the VGG net [26] as the input encoding layer. The kernel size for the 1D convolution is 3, while the VGG net has 4 layers of 2D convolutions with 3x3 filters.…”

Section: Convolution Layers and Attention Headsmentioning

confidence: 99%

See 2 more Smart Citations

Exploring Transformers for Large-Scale Speech Recognition

Lu¹,

Liu²,

Li³

et al. 2020

Interspeech 2020

Self Cite

View full text Add to dashboard Cite

While recurrent neural networks still largely define state-of-theart speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition. Most studies with Transformers have been constrained in a relatively small scale setting, and some forms of data argumentation approaches are usually applied to combat the data sparsity issue. In this paper, we aim at understanding the behaviors of Transformers in the large-scale speech recognition setting, where we have used around 65,000 hours of training data. We investigated various aspects on scaling up Transformers, including model initialization, warmup training as well as different Layer Normalization strategies. In the streaming condition, we compared the widely used attention mask based future context lookahead approach to the Transformer-XL network. From our experiments, we show that Transformers can achieve around 6% relative word error rate (WER) reduction compared to the BLSTM baseline in the offline fashion, while in the streaming fashion, Transformer-XL is comparable to LC-BLSTM with 800 millisecond latency constraint.

show abstract

Section: Results Of Streaming Transformersmentioning

confidence: 99%