2021
DOI: 10.48550/arxiv.2108.02598
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Abstract: End-to-end intent classification using speech has numerous advantages compared to the conventional pipeline approach using automatic speech recognition (ASR), followed by natural language processing modules. It attempts to predict intent from speech without using an intermediate ASR module. However, such end-to-end framework suffers from the unavailability of large speech resources with higher acoustic variation in spoken language understanding. In this work, we exploit the scope of the transformer distillatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 29 publications
0
1
0
Order By: Relevance
“…Its architecture is similar to BERT, but token embedding and pools are removed with several layers less by a factor of two. All other vital operations such as linear layer and layer normalisation are improved using model linear algebra frameworks [55]. Electra Small is another pre-trained transformer model introduced by Google.…”
Section: Sentence Embedding Techniquesmentioning
confidence: 99%
See 1 more Smart Citation
“…Its architecture is similar to BERT, but token embedding and pools are removed with several layers less by a factor of two. All other vital operations such as linear layer and layer normalisation are improved using model linear algebra frameworks [55]. Electra Small is another pre-trained transformer model introduced by Google.…”
Section: Sentence Embedding Techniquesmentioning
confidence: 99%
“…For the remaining training dataset, sentence A is paired with random sentence B with the 'NotNext' label. This pre-training is very beneficial in Question Answering and Natural language Interface [55].…”
Section: A Bert -Bidirectional Encoder Representation From Transformersmentioning
confidence: 99%
“…Recent model compression works fall under three general classes: Pruning which forces some weights or activations to zero [20-22, 26, 32, 34, 40, 47] combined with "zero-aware" memory encoding. Knowledge distillation distills a larger, "teacher" model into a smaller "student" model [1,17,24,27,29,37,43]. Lastly, quantization where the parameters and/or activations are quantized to shorter bit-widths [6,19,39,50,51,54].…”
Section: Introductionmentioning
confidence: 99%