A Time-Restricted Self-Attention Layer for ASR

Povey, Daniel; Hadian, Hossein; Ghahremani, Pegah; Li, Ke; Khudanpur, Sanjeev

doi:10.1109/icassp.2018.8462497

Cited by 148 publications

(120 citation statements)

References 5 publications

Supporting

Mentioning

120

Contrasting

Order By: Relevance

“…e,2 ∈ R d model are trainable weight matrices and bias vectors. In order to control the latency of the encoder architecture, the future context of input sequence X0 is limited to a fixed size, which is referred to as restricted or time-restricted self-attention [16] and was first applied to hybrid HMM-based ASR systems [19]. We can define a time-restricted self-attention encoder ENCSA tr , with n = 1, .…”

Section: Encoder: Time-restricted Self-attentionmentioning

confidence: 99%

Streaming Automatic Speech Recognition with the Transformer Model

Moritz

Hori

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

179

145

View full text Add to dashboard Cite

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context information, has been shown to achieve significantly lower word error rates (WERs) compared to recurrent neural network (RNN) based system architectures. Despite its success, the practical usage is limited to offline ASR tasks, since encoder-decoder architectures typically require an entire speech utterance as input. In this work, we propose a transformer based end-to-end ASR system for streaming ASR, where an output must be generated shortly after each spoken word. To achieve this, we apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism. Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech, which to our knowledge is the best published streaming end-to-end ASR result for this task.

show abstract

Section: Encoder: Time-restricted Self-attentionmentioning

confidence: 99%

Streaming Automatic Speech Recognition with the Transformer Model

Moritz

Hori

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

179

145

View full text Add to dashboard Cite

show abstract

“…We propose the DFSMN-SAN model in which the multi-head self-attention layer (red block in Fig.1) is combined with DF-SMN model. Similar to the combination of TDNN and SAN in [2], we argue that the combination of DFSMN and SAN can achieve a better trade-off between modeling efficiency and capturing the long-term relative dependency. Two types of the combination are empirically evaluated.…”

Section: Dfsmn-sanmentioning

confidence: 81%

“…The two key ingredients include sinusoidal positional encoding and the self-attention mechanism to be context-aware on input word embeddings. Recently, transformer models and their variants have also been actively investigated for speech recognition as well [2,3,4,5]. To work well for ASR modeling, transformer architecture needs to make some revision.…”

Section: Introductionmentioning

confidence: 99%

Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition

Zhao

Chen

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Self-attention networks (SAN) have been introduced into automatic speech recognition (ASR) and achieved state-ofthe-art performance owing to its superior ability in capturing long term dependency. One of the key ingredients is the selfattention mechanism which can be effectively performed on the whole utterance level. In this paper, we try to investigate whether even more information beyond the whole utterance level can be exploited and beneficial. We propose to apply self-attention layer with augmented memory to ASR. Specifically, we first propose a variant model architecture which combines deep feed-forward sequential memory network (DFSMN) with self-attention layers to form a better baseline model compared with a purely self-attention network. Then, we propose and compare two kinds of additional memory structures added into self-attention layers. Experiments on large-scale LVCSR tasks show that on four individual test sets, the DFSMN-SAN architecture outperforms vanilla SAN encoder by 5% relatively in character error rate (CER). More importantly, the additional memory structure provides further 5% to 11% relative improvement in CER.

show abstract

“…In this section, we describe one of the key components in the Transformer architecture, the multi-head self-attention [15], and the timerestricted modification [22] for its application in the masking network of the frontend. Transformers employ the dot-product self-attention for mapping a variable-length input sequence to another sequence of the same length, making them different from RNNs.…”

Section: Transformer With Time-restricted Self-attentionmentioning

confidence: 99%

“…For tasks like speech separation and enhancement, the technique of subsampling is not practical as in speech recognition. Inspired by [21,22], we adjust the self-attention of the Transformers in the masking network to be performed on a local segment of the speech, because those frames have higher correlation. This time-restricted self-attention for the query at time step t is formalized as:…”

Section: Transformer With Time-restricted Self-attentionmentioning

confidence: 99%

End-To-End Multi-Speaker Speech Recognition With Transformer

Chang

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, fully recurrent neural network (RNN) based endto-end models have been proven to be effective for multi-speaker speech recognition in both the single-channel and multi-channel scenarios. In this work, we explore the use of Transformer models for these tasks by focusing on two aspects. First, we replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture. Second, in order to use the Transformer in the masking network of the neural beamformer in the multi-channel case, we modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation. Besides the model architecture improvements, we also incorporate an external dereverberation preprocessing, the weighted prediction error (WPE), enabling our model to handle reverberated signals. Experiments on the spatialized wsj1-2mix corpus show that the Transformer-based models achieve 40.9% and 25.6% relative WER reduction, down to 12.1% and 6.4% WER, under the anechoic condition in single-channel and multi-channel tasks, respectively, while in the reverberant case, our methods achieve 41.5% and 13.8% relative WER reduction, down to 16.5% and 15.2% WER. Attention-DecoderEnc 1 SD < l a t e x i t s h a 1 _ b a s e 6 4 = " Z E x 5 4 u 5 D h 9 n r k V q v e X 1 a T 1 D L f 2 g = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " / 1 B 3 S g + + 4 5 a r q j n e 5 e M f V V 5 b I N Y = " > A A A C A X i c b V D L S s N A F J 3 U V 6 2 v q B v B T b A I r k p S B V 0 W V H B Z 0 T 6 g j W E y n b R D J 5 M w c y O W E D f + i h s X i r j 1 L 9 z 5 N 0 7 T L r T 1 w M D h n H u 5 c 4 4 f c 6 b A t r + N w s L i 0 v J K c b W 0 t r 6 x u W V u 7 B o m w M L K m b V h 9 L o G h m G h C m e T 6 r x Y b U U k Z 6 s 7 K u g R n P v I i a d W q z l m 1 d n t e q d t F H S V y R I 7 J K X H I B a m T G 9 I g T c J I R p 7 J K 3 k z n o w X 4 9 3 4 m I 0 u G c X O A f k D 4 / M H + Y S X P A = = < / l a t e x i t > EncMix < l a t e x i t s h a 1 _ b a s e 6 4 = " t H P Y 8 W F E w 4 g s 1 l 5 Q M 1 a 7 j E W l J 1 Y = " > A A A C A H i c b V D L S s N A F J 3 U V 6 2 v q A s X b g a L 4 K o k V d B l Q Q Q 3 Q g X 7 g D a E y A 2 T 4 V C s D 7 E d S v x B w r v 7 e S E m g 1 D T w 9 G R A Y K Q W v Z n 4 n 9 d L w L 9 y U h 7 G C T A d M T / k J w J D h G d t 4 A G X j I K Y a k K o 5 P q v m I 6 I J B R 0 Z x V d g r 0 Y e Z m 0 6 z X 7 v F a / v 6 g 2 r K K O M j p G J + g M 2 e g S N d A t a q I W o i h D z + g V v R l P x o v x b n z M R 0 t G s X O I / s D 4 / A

show abstract

A Time-Restricted Self-Attention Layer for ASR

Cited by 148 publications

References 5 publications

Streaming Automatic Speech Recognition with the Transformer Model

Streaming Automatic Speech Recognition with the Transformer Model

Dfsmn-San with Persistent Memory Model for Automatic Speech Recognition

End-To-End Multi-Speaker Speech Recognition With Transformer

Contact Info

Product

Resources

About