Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1058
|View full text |Cite
|
Sign up to set email alerts
|

MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
25
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 55 publications
(25 citation statements)
references
References 0 publications
0
25
0
Order By: Relevance
“…Sainath et al [1] introduced CNNs into KWS and showed that CNNs performed well on small footprint keyword spotting. Since then, multiple off-the-shelf CNN backbones have been widely applied to KWS tasks, such as deep residual network (ResNet) [2], separable CNN [3,4,5,6], temporal CNN [7] and SincNet [8]. There are also other efforts to boost performance of CNN models for KWS by combining other deep learning models, such as recurrent neural network (RNN) [9], bidirectional long short-term memory (BiLSTM) [10] and streaming layers [11].…”
Section: Introductionmentioning
confidence: 99%
“…Sainath et al [1] introduced CNNs into KWS and showed that CNNs performed well on small footprint keyword spotting. Since then, multiple off-the-shelf CNN backbones have been widely applied to KWS tasks, such as deep residual network (ResNet) [2], separable CNN [3,4,5,6], temporal CNN [7] and SincNet [8]. There are also other efforts to boost performance of CNN models for KWS by combining other deep learning models, such as recurrent neural network (RNN) [9], bidirectional long short-term memory (BiLSTM) [10] and streaming layers [11].…”
Section: Introductionmentioning
confidence: 99%
“…With the advent of edge computing, research on KWS based on deep learning has been devoted to increase the performance by achieving a faster inference or decreasing the number of parameters [19], [20]. Temporal convolution [19] can reduce the number of parameters of existing models, and a 1D timechannel separable convolutional neural network [20] has further lightened the model. Likewise, the performance degradation after reducing the size of a ResNet model has been prevented by using data augmentation [34].…”
Section: B Kwsmentioning
confidence: 99%
“…Commands to control applications and services include "play the music," "turn off," and "how is the weather tomorrow?" While the applicability of neural networks to KWS has been demonstrated, recent studies have pursued performance improvement and reduction in the number of parameters [18]- [20], and other studies have focused on improving the realtime KWS performance [12], [21].…”
Section: Introductionmentioning
confidence: 99%
“…They proposed a Convolutional Neural Network-Time Distributed (CNN-TD) model (with 740K parameters) that outperformed existing models including Bi-LSTM (with 300K parameters), CLDNN (with 1M parameters) and ResNet 960 (with 30M parameters) [9] on the benchmark evaluation dataset AVA-speech. Furthermore, acoustic models using 1D CNNs have shown great potential in automatic speech recognition [14,15,16] and speech command detection [17] tasks.…”
Section: Introductionmentioning
confidence: 99%
“…Built on top of previous successful applications of 1D CNNs to speech processing tasks, we introduce MarbleNet, a compact endto-end neural network for VAD inspired by the QuartzNet architecture [14] and the MatchboxNet model [17]. MarbleNet is constructed with a stack of blocks with residual connections [18].…”
Section: Introductionmentioning
confidence: 99%