2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018
DOI: 10.1109/icassp.2018.8462497
|View full text |Cite
|
Sign up to set email alerts
|

A Time-Restricted Self-Attention Layer for ASR

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
120
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 148 publications
(120 citation statements)
references
References 5 publications
0
120
0
Order By: Relevance
“…e,2 ∈ R d model are trainable weight matrices and bias vectors. In order to control the latency of the encoder architecture, the future context of input sequence X0 is limited to a fixed size, which is referred to as restricted or time-restricted self-attention [16] and was first applied to hybrid HMM-based ASR systems [19]. We can define a time-restricted self-attention encoder ENCSA tr , with n = 1, .…”
Section: Encoder: Time-restricted Self-attentionmentioning
confidence: 99%
“…e,2 ∈ R d model are trainable weight matrices and bias vectors. In order to control the latency of the encoder architecture, the future context of input sequence X0 is limited to a fixed size, which is referred to as restricted or time-restricted self-attention [16] and was first applied to hybrid HMM-based ASR systems [19]. We can define a time-restricted self-attention encoder ENCSA tr , with n = 1, .…”
Section: Encoder: Time-restricted Self-attentionmentioning
confidence: 99%
“…We propose the DFSMN-SAN model in which the multi-head self-attention layer (red block in Fig.1) is combined with DF-SMN model. Similar to the combination of TDNN and SAN in [2], we argue that the combination of DFSMN and SAN can achieve a better trade-off between modeling efficiency and capturing the long-term relative dependency. Two types of the combination are empirically evaluated.…”
Section: Dfsmn-sanmentioning
confidence: 81%
“…The two key ingredients include sinusoidal positional encoding and the self-attention mechanism to be context-aware on input word embeddings. Recently, transformer models and their variants have also been actively investigated for speech recognition as well [2,3,4,5]. To work well for ASR modeling, transformer architecture needs to make some revision.…”
Section: Introductionmentioning
confidence: 99%
“…In this section, we describe one of the key components in the Transformer architecture, the multi-head self-attention [15], and the timerestricted modification [22] for its application in the masking network of the frontend. Transformers employ the dot-product self-attention for mapping a variable-length input sequence to another sequence of the same length, making them different from RNNs.…”
Section: Transformer With Time-restricted Self-attentionmentioning
confidence: 99%
“…For tasks like speech separation and enhancement, the technique of subsampling is not practical as in speech recognition. Inspired by [21,22], we adjust the self-attention of the Transformers in the masking network to be performed on a local segment of the speech, because those frames have higher correlation. This time-restricted self-attention for the query at time step t is formalized as:…”
Section: Transformer With Time-restricted Self-attentionmentioning
confidence: 99%