Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Jiang, Yidi; Sharma, Bidisha; Madhavi, Maulik C.; Li, Haizhou

doi:10.48550/arxiv.2108.02598

Cited by 2 publications

(3 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Its architecture is similar to BERT, but token embedding and pools are removed with several layers less by a factor of two. All other vital operations such as linear layer and layer normalisation are improved using model linear algebra frameworks [55]. Electra Small is another pre-trained transformer model introduced by Google.…”

Section: Sentence Embedding Techniquesmentioning

confidence: 99%

See 1 more Smart Citation

New Avenues for Automated Railway Safety Information Processing in Enterprise Architecture: An NLP Approach

et al. 2023

View full text Add to dashboard Cite

Enterprise Architecture (EA) is crucial in any organisation as it defines the basic building blocks of a business. It is typically presented as a set of documents that help all departments understand the business model. In EA, safety documents are used to manage and understand safety risks. A novel similarity system for railway safety document processing is presented in this work. It measures the feasibility of automated updating of EA models with the Rule Book by verifying whether Rail Safety and Standards Board (RSSB's) Rule Book clauses are present and complete in existing EA models. Additionally, a Natural Language Processing (NLP) based search feature was developed to drill through the database to find similar existing rules, principles, and clauses based on semantic similarity. The result will display the most similar clauses and rules with similarity scores and document names. In this study, different pre-trained Electra Small, DistilBERT (Distillation Bidirectional Encoder Representations from Transformers) Base and BERT (Bidirectional Encoder Representations from Transformers) Base were used to embed text. Additionally, the similarity between document rules was measured by cosine similarity metrics. With conclusive evidence, our findings show that BERT Base exceeds the other embedding methods in the semantic comparison of documents.INDEX TERMS Natural language processing, enterprise architecture models, distillation bidirectional encoder representations from transformers, cosine similarity.

show abstract

Section: Sentence Embedding Techniquesmentioning

confidence: 99%

“…For the remaining training dataset, sentence A is paired with random sentence B with the 'NotNext' label. This pre-training is very beneficial in Question Answering and Natural language Interface [55].…”

Section: A Bert -Bidirectional Encoder Representation From Transformersmentioning

confidence: 99%

New Avenues for Automated Railway Safety Information Processing in Enterprise Architecture: An NLP Approach

et al. 2023

View full text Add to dashboard Cite

show abstract

“…Recent model compression works fall under three general classes: Pruning which forces some weights or activations to zero [20-22, 26, 32, 34, 40, 47] combined with "zero-aware" memory encoding. Knowledge distillation distills a larger, "teacher" model into a smaller "student" model [1,17,24,27,29,37,43]. Lastly, quantization where the parameters and/or activations are quantized to shorter bit-widths [6,19,39,50,51,54].…”

Section: Introductionmentioning

confidence: 99%

Mokey

Zadeh

Mahmoud

Abdelhadi

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Increasingly larger and better Transformer models keep advancing state-of-the-art accuracy and capability for Natural Language Processing applications. These models demand more computational power, storage, and energy. Mokey reduces the footprint of stateof-the-art 32-bit or 16-bit floating-point transformer models by quantizing all values to 4-bit indexes into dictionaries of representative 16-bit fixed-point centroids. Mokey does not need fine-tuning, an essential feature as often the training resources or datasets are not available to many. Exploiting the range of values that naturally occur in transformer models, Mokey selects centroid values to also fit an exponential curve. This unique feature enables Mokey to replace the bulk of the original multiply-accumulate operations with narrow 3b fixed-point additions resulting in an area-and energy-efficient hardware accelerator design. Over a set of stateof-the-art transformer models, the Mokey accelerator delivers an order of magnitude improvements in energy efficiency over a Tensor Cores-based accelerator while improving performance by at least 4× and as much as 15× depending on the model and on-chip buffering capacity. Optionally, Mokey can be used as memory compression assist for any other accelerator transparently stashing wide floating-point or fixed-point activations or weights into narrow 4-bit indexes. Mokey proves superior to prior state-of-the-art quantization methods for Transformers. CCS CONCEPTS• Computing methodologies → Natural language processing; • Computer systems organization → Neural networks.

show abstract

Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification

Cited by 2 publications

References 29 publications

New Avenues for Automated Railway Safety Information Processing in Enterprise Architecture: An NLP Approach

New Avenues for Automated Railway Safety Information Processing in Enterprise Architecture: An NLP Approach

Mokey

Contact Info

Product

Resources

About