Keyword Transformer: A Self-Attention Model for Keyword Spotting

Berg, Axel; O’Connor, Mark; Cruz, Miguel Tairum

doi:10.21437/interspeech.2021-1286

Cited by 71 publications

(23 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The key component in Transformers is the MHSA containing several attention mechanisms (heads) that can attend to different parts of the inputs in parallel. We base our explanation on the KWT, proposed in [2]. This model takes as an input the MFCC spectrogram of T non-overlapping patches 𝑋 𝑀𝐹𝐶𝐶 ∈ 𝑅 𝑇 𝑥 𝐹 , with 𝑡 = 1, ...,𝑇 and 𝑓 = 1, ..., 𝐹 corresponding to time windows and frequencies, respectively.…”

Section: The Keyword Transformermentioning

confidence: 99%

“…Moreover, no special and expensive hardware has to be developed as only comparisons are used in the algorithm. The evaluation is done on a pretrained Keyword Transformer model (KWT) [2] using the Google Speech Commands Dataset (GSCD) [35] with the focus on the accuracy-complexity trade-off. The results show that the number of computations can be reduced by 4.2𝑥 without losing any accuracy, and 7.5𝑥 while sacrificing 1% of the baseline accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Zuzana¹,

Verhelst²

2022

Preprint

View full text Add to dashboard Cite

Multi-head self-attention forms the core of Transformer networks. However, their quadratically growing complexity with respect to the input sequence length impedes their deployment on resourceconstrained edge devices. We address this challenge by proposing a dynamic pruning method, which exploits the temporal stability of data across tokens to reduce inference cost. The threshold-based method only retains significant differences between the subsequent tokens, effectively reducing the number of multiply-accumulates, as well as the internal tensor data sizes. The approach is evaluated on the Google Speech Commands Dataset for keyword spotting, and the performance is compared against the baseline Keyword Transformer. Our experiments show that we can reduce ∼ 80% of operations while maintaining the original 98.4% accuracy. Moreover, a reduction of ∼ 87 − 94% operations can be achieved when only degrading the accuracy by 1-4%, speeding up the multi-head selfattention inference by a factor of ∼ 7.5 − 16.

show abstract

Section: The Keyword Transformermentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Zuzana¹,

Verhelst²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Transformers are also gaining predominance in the audio field. There are methods such as TERA (Liu et al, 2021a), Conformer (Gulati et al, 2020) (convolution-augmented transformers, used in speech recognition), and then ViT-like approaches such as the Keyword Transformer (KWT) (Berg et al, 2021) and the Audio Spectrogram Transformer (AST) (Gong et al, 2021a). In recent self-supervised audio representation learning methods, transformer-based encoders have seen much use alongside convolutional or convolutionalrecurrent encoders (Liu et al, 2022).…”

Section: Transformersmentioning

confidence: 99%

Learning Audio Representations with MLPs

Morshed¹,

Ahsan²,

Mahmud³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we propose an efficient MLP-based approach for learning audio representations, namely timestamp and scene-level audio embeddings. We use an encoder consisting of sequentially stacked gated MLP blocks, which accept 2D MFCCs as inputs. In addition, we also provide a simple temporal interpolation-based algorithm for computing scene-level embeddings from timestamp embeddings. The audio representations generated by our method are evaluated across a diverse set of benchmarks at the Holistic Evaluation of Audio Representations (HEAR) challenge, hosted at the NeurIPS 2021 competition track. We achieved first place on the Speech Commands (full), Speech Commands (5 hours), and the Mridingham Tonic benchmarks. Furthermore, our approach is also the most resource-efficient among all the submitted methods, in terms of both the number of model parameters and the time required to compute embeddings.

show abstract

“…"yes", "up", "stop") and the task is to classify these in a 12 or 35 classes setting. The datasets comes pre-partitioned into 35 classes and in order to obtain the 12-classes version, the standard approach [9,20,71] is to keep 10 classes of interest (i.e. "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"), place the remaining 25 under the "unknown" class and, introduce a new class "silence" where no spoken word appear is the audio clip.…”

Section: Detailed Experimental Setupmentioning

confidence: 99%

FedorAS: Federated Architecture Search under system heterogeneity

Dudziak¹,

Laskaridis²,

Fernández-Marqués³

2022

Preprint

View full text Add to dashboard Cite

Federated learning (FL) has recently gained considerable attention due to its ability to use decentralised data while preserving privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained (hardware-aware) and unconstrained settings. However, even the most recent work laying at the intersection of NAS and FL assumes homogeneous compute environment with datacenter-grade hardware and does not address the issues of working with constrained, heterogeneous devices. As a result, practical usage of NAS in a federated setting remains an open problem that we address in our work. We design our system, FedorAS, to discover and train promising architectures when dealing with devices of varying capabilities holding non-IID distributed data, and present empirical evidence of its effectiveness across different settings. Specifically, we evaluate FedorAS across datasets spanning three different modalities (vision, speech, text) and show its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency. * Indicates equal contribution.Preprint. Under review.

show abstract

Keyword Transformer: A Self-Attention Model for Keyword Spotting

Cited by 71 publications

References 0 publications

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Delta Keyword Transformer: Bringing Transformers to the Edge through Dynamically Pruned Multi-Head Self-Attention

Learning Audio Representations with MLPs

FedorAS: Federated Architecture Search under system heterogeneity

Contact Info

Product

Resources

About