Tensorized Embedding Layers

Hrinchuk, Oleksii; Khrulkov, Valentin; Mirvakhabova, Leyla; Orlova, Elena D.; Oseledets, Ivan

doi:10.18653/v1/2020.findings-emnlp.436

Cited by 20 publications

(9 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Matrix decomposition (e.g., ALBERT (Lan et al, 2019) in the embedding layer and (Noach & Goldberg, 2020)) could decrease parameter scale with a linear factor depending on the selected rank. More advanced tensor decomposition approaches can be implemented by tensor network, which has recently been used to compress general neural networks (Gao et al, 2020;Novikov et al, 2015), compress embedding layer (Khrulkov et al, 2019;Hrinchuk et al, 2020;Panahi et al, 2019).…”

Section: Methodsmentioning

confidence: 99%

Exploring Extreme Parameter Compression for Pre-trained Language Models

Ren¹,

Wang²,

Shang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs and carbon emissions. Compressing PLMs like BERT with negligible performance loss for faster inference and cheaper deployment has attracted much attention. In this work, we aim to explore larger compression ratios for PLMs, among which tensor decomposition is a potential but under-investigated one. Two decomposition and reconstruction protocols are further proposed to improve the effectiveness and efficiency during compression. Our compressed BERT 1 with 1/7 parameters in Transformer layers performs on-par with, sometimes slightly better than the original BERT in GLUE benchmark. A tiny version achieves 96.7% performance of BERT-base with 1/48 encoder parameters (i.e., less than 2M parameters excluding the embedding layer) and 2.7× faster on inference. To show that the proposed method is orthogonal to existing compression methods like knowledge distillation, we also explore the benefit of the proposed method on a distilled BERT.

show abstract

Section: Methodsmentioning

confidence: 99%

Exploring Extreme Parameter Compression for Pre-trained Language Models

Ren¹,

Wang²,

Shang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Among the promising directions, we should mention computations with reduced precision, approximate methods [Novikov et al, 2022], randomized computations , and structured NN layers [Hrinchuk et al, 2020], including those based on tensor factorizations. We should highlight the importance for these approximate methods to be additive in the sense that they can be combined and still provide sufficient enough performance with reasonable quality degradation.…”

Section: Conclusion and Further Researchmentioning

confidence: 99%

Survey on Large Scale Neural Network Training

Gusak¹,

Cherniuk²,

Shilova³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Modern Deep Neural Networks (DNNs) require significant memory to store weight, activations, and other intermediate tensors during training. Hence, many models don't fit one GPU device or can be trained using only a small per-GPU batch size. This survey provides a systematic overview of the approaches that enable more efficient DNNs training. We analyze techniques that save memory and make good use of computation and communication resources on architectures with a single or several GPUs. We summarize the main categories of strategies and compare strategies within and across categories. Along with approaches proposed in the literature, we discuss available implementations.

show abstract

“…(Chen et al, 2018) proposed the blockwise low-rank approximation method for word embedding. (Hrinchuk et al, 2020) devised a way of interpreting an embedding matrix into a 3-dimensional tensor and proposed an embedding structure by decomposing it with tensor-train decomposition. (Panahi et al, 2020) proposed a smallsize word embedding structure inspired by quantum entanglement.…”

Section: Related Workmentioning

confidence: 99%

“…TensorTrain is the tensor-train decomposedbased method in (Hrinchuk et al, 2020). In (Hrinchuk et al, 2020), TensorTrain is computed by training from scratch.…”

Section: Implementation Detailsmentioning

confidence: 99%

See 1 more Smart Citation

Block-wise Word Embedding Compression Revisited: Better Weighting and Structuring

Lee

Moon³

2021

Findings of the Association for Computational Linguistics: EMNLP 2021

View full text Add to dashboard Cite

Word embedding is essential for neural network models for various natural language processing tasks. Since the word embedding usually has a considerable size, in order to deploy a neural network model having it on edge devices, it should be effectively compressed. There was a study for proposing a block-wise low-rank approximation method for word embedding, called GroupReduce. Even if their structure is effective, the properties behind the concept of the block-wise word embedding compression were not sufficiently explored. Motivated by this, we improve GroupReduce in terms of word weighting and structuring. For word weighting, we propose a simple yet effective method inspired by the term frequency-inverse document frequency method and a novel differentiable method. Based on them, we construct a discriminative word embedding compression algorithm. In the experiments, we demonstrate that the proposed algorithm more effectively finds word weights than competitors in most cases. In addition, we show that the proposed algorithm can act like a framework through successful cooperation with quantization.

show abstract

Tensorized Embedding Layers

Cited by 20 publications

References 24 publications

Exploring Extreme Parameter Compression for Pre-trained Language Models

Exploring Extreme Parameter Compression for Pre-trained Language Models

Survey on Large Scale Neural Network Training

Block-wise Word Embedding Compression Revisited: Better Weighting and Structuring

Contact Info

Product

Resources

About