Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-1286
|View full text |Cite
|
Sign up to set email alerts
|

Keyword Transformer: A Self-Attention Model for Keyword Spotting

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
19
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 71 publications
(23 citation statements)
references
References 0 publications
0
19
0
Order By: Relevance
“…The key component in Transformers is the MHSA containing several attention mechanisms (heads) that can attend to different parts of the inputs in parallel. We base our explanation on the KWT, proposed in [2]. This model takes as an input the MFCC spectrogram of T non-overlapping patches 𝑋 𝑀𝐹𝐶𝐶 ∈ 𝑅 𝑇 𝑥 𝐹 , with 𝑡 = 1, ...,𝑇 and 𝑓 = 1, ..., 𝐹 corresponding to time windows and frequencies, respectively.…”
Section: The Keyword Transformermentioning
confidence: 99%
See 1 more Smart Citation
“…The key component in Transformers is the MHSA containing several attention mechanisms (heads) that can attend to different parts of the inputs in parallel. We base our explanation on the KWT, proposed in [2]. This model takes as an input the MFCC spectrogram of T non-overlapping patches 𝑋 𝑀𝐹𝐶𝐶 ∈ 𝑅 𝑇 𝑥 𝐹 , with 𝑡 = 1, ...,𝑇 and 𝑓 = 1, ..., 𝐹 corresponding to time windows and frequencies, respectively.…”
Section: The Keyword Transformermentioning
confidence: 99%
“…Moreover, no special and expensive hardware has to be developed as only comparisons are used in the algorithm. The evaluation is done on a pretrained Keyword Transformer model (KWT) [2] using the Google Speech Commands Dataset (GSCD) [35] with the focus on the accuracy-complexity trade-off. The results show that the number of computations can be reduced by 4.2𝑥 without losing any accuracy, and 7.5𝑥 while sacrificing 1% of the baseline accuracy.…”
Section: Introductionmentioning
confidence: 99%
“…Transformers are also gaining predominance in the audio field. There are methods such as TERA (Liu et al, 2021a), Conformer (Gulati et al, 2020) (convolution-augmented transformers, used in speech recognition), and then ViT-like approaches such as the Keyword Transformer (KWT) (Berg et al, 2021) and the Audio Spectrogram Transformer (AST) (Gong et al, 2021a). In recent self-supervised audio representation learning methods, transformer-based encoders have seen much use alongside convolutional or convolutionalrecurrent encoders (Liu et al, 2022).…”
Section: Transformersmentioning
confidence: 99%
“…"yes", "up", "stop") and the task is to classify these in a 12 or 35 classes setting. The datasets comes pre-partitioned into 35 classes and in order to obtain the 12-classes version, the standard approach [9,20,71] is to keep 10 classes of interest (i.e. "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"), place the remaining 25 under the "unknown" class and, introduce a new class "silence" where no spoken word appear is the audio clip.…”
Section: Detailed Experimental Setupmentioning
confidence: 99%