Learning Compressed Sentence Representations for On-Device Text Processing

Shen, Dinghan; Cheng, Pengyu; Sundararaman, Dhanasekar; Zhang, Mengjie; Yang, Qian; Tang, Meng; Çelikyılmaz, Aslı; Carin, Lawrence

doi:10.18653/v1/p19-1011

Cited by 9 publications

(8 citation statements)

References 29 publications

(43 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Simultaneously, we find that our static embeddings substantially outperform Word2Vec and GloVe and therefore suggests our method serves the dual purpose of being a lightweight mechanism for generating static embeddings that track with advances in contextualized representations. Since static embeddings have significant advantages with respect to speed, computational resources, and ease of use, these results have important implications for resource-constrained settings (Shen et al, 2019), environmental concerns (Strubell et al, 2019), and the broader accessibility of NLP technologies. 2 Alongside more developed methods for embedding analysis, the static embedding setting is also equipped with a richer body of work regarding social bias.…”

Section: Introductionmentioning

confidence: 99%

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Bommasani¹,

Davis²,

Cardie³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

107

144

View full text Add to dashboard Cite

Contextualized representations (e.g. ELMo, BERT) have become the default pretrained representations for downstream NLP applications. In some settings, this transition has rendered their static embedding predecessors (e.g. Word2Vec, GloVe) obsolete. As a side-effect, we observe that older interpretability methods for static embeddings -while more mature than those available for their dynamic counterparts -are underutilized in studying newer contextualized representations. Consequently, we introduce simple and fully general methods for converting from contextualized representations to static lookup-table embeddings which we apply to 5 popular pretrained models and 9 sets of pretrained weights. Our analysis of the resulting static embeddings notably reveals that pooling over many contexts significantly improves representational quality under intrinsic evaluation. Complementary to analyzing representational quality, we consider social biases encoded in pretrained representations with respect to gender, race/ethnicity, and religion and find that bias is encoded disparately across pretrained models and internal layers even for models that share the same training data. Concerningly, we find dramatic inconsistencies between social bias estimators for word embeddings.

show abstract

Section: Introductionmentioning

confidence: 99%

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Bommasani¹,

Davis²,

Cardie³

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

107

144

View full text Add to dashboard Cite

show abstract

“…Research works have been carried out to model the order of words when learning the distributed sentence representation (Le and Mikolov, 2014;Kiros et al, 2015;Conneau et al, 2017;Pagliardini et al, 2018;Gupta et al, 2019;Shen et al, 2019). Le and Mikolov propose Doc2vec (Le and Mikolov, 2014) to add a paragraph vector to represent the missing information from the current context.…”

Section: Sentence Embeddingmentioning

confidence: 99%

“…Gupta et al (2019) propose two modifications of Word2vec by considering higher-order word n-grams along with uni-gram during training. Shen et al (2019) use InferSent (Conneau et al, 2017) for sentence embeddings based on word vectors learned by Glove (Pennington et al, 2014) or FastText (Joulin et al, 2017). Gupta et al (2019) claim that training word embeddings along with higher n-gram embeddings helps in the removal of the contextual information from the uni-gram, resulting in better stand-alone word embeddings.…”

Section: Sentence Embeddingmentioning

confidence: 99%

Learning distributed sentence vectors with bi-directional 3D convolutions

Liu

Wang²,

Yin

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

We propose to learn distributed sentence representation using the text's visual features as input. Different from the existing methods that render the words (or characters) of a sentence into images separately, we fold these images into a 3-dimensional sentence tensor. Then, multiple 3dimensional convolutions with different lengths (the third dimension) are applied to the sentence tensor, which would act as bi-gram, tri-gram, quad-gram, and even five-gram detectors jointly. Similar to the Bi-LSTMs, these n-gram detectors learn both forward and backward distributional semantic knowledge from the sentence tensor. The proposed model uses bi-directional convolutions to learn text embedding according to the semantic order of words. The feature maps from the two directions are concatenated for final sentence embedding learning. Our model involves only a single layer of convolution which makes it easy and fast to train. We evaluate the sentence embeddings on several downstream natural language processing (NLP) tasks, which demonstrate surprisingly excellent performance of the proposed model.

show abstract

“…Quantization and other compression techniques have been explored for word embeddings (Shu and Nakayama, 2017;Tissier et al, 2019) and sentence embeddings (Shen et al, 2019 Fan et al, 2020). Quantization is complementary to the approaches we consider and is explored more in Section 5.…”

Section: Related Workmentioning

confidence: 99%

General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

Ott

et al. 2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

The state of the art on many NLP tasks is currently achieved by large pre-trained language models, which require a considerable amount of computation. We aim to reduce the inference cost in a setting where many different predictions are made on a single piece of text. In that case, computational cost during inference can be amortized over the different predictions (tasks) using a shared text encoder. We compare approaches for training such an encoder and show that encoders pre-trained over multiple tasks generalize well to unseen tasks. We also compare ways of extracting fixed-and limited-size representations from this encoder, including pooling features extracted from multiple layers or positions. Our best approach compares favorably to knowledge distillation, achieving higher accuracy and lower computational cost once the system is handling around 7 tasks. Further, we show that through binary quantization, we can reduce the size of the extracted representations by a factor of 16 to store them for later use. The resulting method offers a compelling solution for using large-scale pre-trained models at a fraction of the computational cost when multiple tasks are performed on the same text.

show abstract

Learning Compressed Sentence Representations for On-Device Text Processing

Cited by 9 publications

References 29 publications

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Interpreting Pretrained Contextualized Representations via Reductions to Static Embeddings

Learning distributed sentence vectors with bi-directional 3D convolutions

General Purpose Text Embeddings from Pre-trained Language Models for Scalable Inference

Contact Info

Product

Resources

About