MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

Majumdar, Somshubra; Ginsburg, Boris

doi:10.21437/interspeech.2020-1058

Cited by 55 publications

(25 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sainath et al [1] introduced CNNs into KWS and showed that CNNs performed well on small footprint keyword spotting. Since then, multiple off-the-shelf CNN backbones have been widely applied to KWS tasks, such as deep residual network (ResNet) [2], separable CNN [3,4,5,6], temporal CNN [7] and SincNet [8]. There are also other efforts to boost performance of CNN models for KWS by combining other deep learning models, such as recurrent neural network (RNN) [9], bidirectional long short-term memory (BiLSTM) [10] and streaming layers [11].…”

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search for Keyword Spotting

Mo¹,

Yu²,

Salameh³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

Keyword spotting aims to identify specific keyword audio utterances. In recent years, deep convolutional neural networks have been widely utilized in keyword spotting systems. However, their model architectures are mainly based on off-the-shelf backbones such as VGG-Net or ResNet, instead of specially designed for the task. In this paper, we utilize neural architecture search to design convolutional neural network models that can boost the performance of keyword spotting while maintaining an acceptable memory footprint. Specifically, we search the model operators and their connections in a specific search space with Encoder-Decoder neural architecture optimization. Extensive evaluations on Google's Speech Commands Dataset show that the model architecture searched by our approach achieves a state-of-the-art accuracy of over 97%.

show abstract

Section: Introductionmentioning

confidence: 99%

Neural Architecture Search for Keyword Spotting

Mo¹,

Yu²,

Salameh³

et al. 2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…With the advent of edge computing, research on KWS based on deep learning has been devoted to increase the performance by achieving a faster inference or decreasing the number of parameters [19], [20]. Temporal convolution [19] can reduce the number of parameters of existing models, and a 1D timechannel separable convolutional neural network [20] has further lightened the model. Likewise, the performance degradation after reducing the size of a ResNet model has been prevented by using data augmentation [34].…”

Section: B Kwsmentioning

confidence: 99%

“…Commands to control applications and services include "play the music," "turn off," and "how is the weather tomorrow?" While the applicability of neural networks to KWS has been demonstrated, recent studies have pursued performance improvement and reduction in the number of parameters [18]- [20], and other studies have focused on improving the realtime KWS performance [12], [21].…”

Section: Introductionmentioning

confidence: 99%

Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting

Seo

Jung

2021

IEEE Access

View full text Add to dashboard Cite

With the expanding development of on-device artificial intelligence, voice-enabled devices such as smart speakers, wearables, and other on-device or edge processing systems have been proposed. However, building or obtaining large training datasets that are essential for robust keyword spotting (KWS) remains cumbersome. To address this problem, we propose a deep neural network that can rapidly establish a high-performance KWS system from arbitrary keyword instruction sets. We use an encoder pretrained with a large-scale speech corpus as the backbone network and then design an effective transfer network for KWS. To demonstrate the feasibility of the proposed network, various experiments were conducted on Google Speech Command Datasets V1 and V2. In addition, to verify the applicability of the network for different languages, we conducted experiments using three different Korean speech command datasets. The proposed network outperforms state-of-the-art deep neural networks in both experiments. Furthermore, the proposed network can understand real human voice even when trained with synthetic text-to-speech data.

show abstract

“…They proposed a Convolutional Neural Network-Time Distributed (CNN-TD) model (with 740K parameters) that outperformed existing models including Bi-LSTM (with 300K parameters), CLDNN (with 1M parameters) and ResNet 960 (with 30M parameters) [9] on the benchmark evaluation dataset AVA-speech. Furthermore, acoustic models using 1D CNNs have shown great potential in automatic speech recognition [14,15,16] and speech command detection [17] tasks.…”

Section: Introductionmentioning

confidence: 99%

“…Built on top of previous successful applications of 1D CNNs to speech processing tasks, we introduce MarbleNet, a compact endto-end neural network for VAD inspired by the QuartzNet architecture [14] and the MatchboxNet model [17]. MarbleNet is constructed with a stack of blocks with residual connections [18].…”

Section: Introductionmentioning

confidence: 99%

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

Jia

Majumdar

Ginsburg

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD). MarbleNet is a deep residual network composed from blocks of 1D time-channel separable convolution, batchnormalization, ReLU and dropout layers. When compared to a stateof-the-art VAD model, MarbleNet is able to achieve similar performance with roughly 1/10-th the parameter cost. We further conduct extensive ablation studies on different training methods and choices of parameters in order to study the robustness of MarbleNet in realworld VAD tasks.

show abstract

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

Cited by 55 publications

References 0 publications

Neural Architecture Search for Keyword Spotting

Neural Architecture Search for Keyword Spotting

Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting

MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection

Contact Info

Product

Resources

About