Learning Efficient Representations for Keyword Spotting with Triplet Loss

Vygon, Roman; Mikhaylovskiy, Nikolay

doi:10.1007/978-3-030-87802-3_69

Cited by 29 publications

(12 citation statements)

References 53 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(%) Model SCV2 Acc. (%) PANN [11] 90.5 RES-15 [21] 97.0 AST [14] 95.6 ± 0.4 AST [14] 98.1 ± 0.05 ERANN [22] 96. which is consistent to AST. Finally, we set 4 network groups with 2, 2, 6, 2 swin-transformer blocks respectively.…”

Section: Modelmentioning

confidence: 99%

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

Chen¹,

Du²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35\% model parameters and 15\% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

show abstract

Section: Modelmentioning

confidence: 99%

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

Chen¹,

Du²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Inspired by [37] and [38], EdgeCRNN [39] was proposed, an edgecomputing oriented model of acoustic feature enhancement for keyword spotting. Recently, [40] combined a triplet lossbased embedding and a variant of K-Nearest Neighbor (KNN) for classification. We also evaluated our speech augmentation based unsupervised learning method on this dataset, and compared with other unsupervised approaches, including CPC [23], APC [24] and MPC [25].…”

Section: Related Workmentioning

confidence: 99%

Speech Augmentation Based Unsupervised Learning for Keyword Spotting

Wang¹,

Cheng²,

Tang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this paper, we investigated a speech augmentation based unsupervised learning approach for keyword spotting (KWS) task. KWS is a useful speech application, yet also heavily depends on the labeled data. We designed a CNN-Attention architecture to conduct the KWS task. CNN layers focus on the local acoustic features, and attention layers model the longtime dependency. To improve the robustness of KWS model, we also proposed an unsupervised learning method. The unsupervised loss is based on the similarity between the original and augmented speech features, as well as the audio reconstructing information. Two speech augmentation methods are explored in the unsupervised learning: speed and intensity. The experiments on Google Speech Commands V2 Dataset demonstrated that our CNN-Attention model has competitive results. Moreover, the augmentation based unsupervised learning could further improve the classification accuracy of KWS task. In our experiments, with augmentation based unsupervised learning, our KWS model achieves better performance than other unsupervised methods, such as CPC, APC, and MPC.

show abstract

“…"yes", "up", "stop") and the task is to classify these in a 12 or 35 classes setting. The datasets comes pre-partitioned into 35 classes and in order to obtain the 12-classes version, the standard approach [9,20,71] is to keep 10 classes of interest (i.e. "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go"), place the remaining 25 under the "unknown" class and, introduce a new class "silence" where no spoken word appear is the audio clip.…”

Section: Detailed Experimental Setupmentioning

confidence: 99%

FedorAS: Federated Architecture Search under system heterogeneity

Dudziak¹,

Laskaridis²,

Fernández-Marqués³

2022

Preprint

View full text Add to dashboard Cite

Federated learning (FL) has recently gained considerable attention due to its ability to use decentralised data while preserving privacy. However, it also poses additional challenges related to the heterogeneity of the participating devices, both in terms of their computational capabilities and contributed data. Meanwhile, Neural Architecture Search (NAS) has been successfully used with centralised datasets, producing state-of-the-art results in constrained (hardware-aware) and unconstrained settings. However, even the most recent work laying at the intersection of NAS and FL assumes homogeneous compute environment with datacenter-grade hardware and does not address the issues of working with constrained, heterogeneous devices. As a result, practical usage of NAS in a federated setting remains an open problem that we address in our work. We design our system, FedorAS, to discover and train promising architectures when dealing with devices of varying capabilities holding non-IID distributed data, and present empirical evidence of its effectiveness across different settings. Specifically, we evaluate FedorAS across datasets spanning three different modalities (vision, speech, text) and show its better performance compared to state-of-the-art federated solutions, while maintaining resource efficiency. * Indicates equal contribution.Preprint. Under review.

show abstract

Learning Efficient Representations for Keyword Spotting with Triplet Loss

Cited by 29 publications

References 53 publications

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection

Speech Augmentation Based Unsupervised Learning for Keyword Spotting

FedorAS: Federated Architecture Search under system heterogeneity

Contact Info

Product

Resources

About