Using the Output Embedding to Improve Language Models

Press, Ofir; Wolf, Lior

doi:10.48550/arxiv.1608.05859

Cited by 68 publications

(74 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We evaluate the performance of model slicing on state-ofthe-art neural networks on two categories of public benchmark tasks, specifically evaluating model slicing for dense layers, i.e. fully-connected and recurrent layers on language modeling [37,54,39] in Section 5.2 and evaluating model slicing for convolutional layers on image classification [43,16,53] in Section 5.3. Experimental setups of model slicing are provided in Section 5.1; cascade ranking simulation of example applications and visualization on the model slicing training are given in Section 5.4 and Section 5.5 respectively.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Model slicing for supporting complex analytics with elastic inference cost and resource constraints

et al. 2019

View full text Add to dashboard Cite

Deep learning models have been used to support analytics beyond simple aggregation, where deeper and wider models have been shown to yield great results. These models consume a huge amount of memory and computational operations. However, most of the large-scale industrial applications are often computational budget constrained. In practice, the peak workload of inference service could be 10x higher than the average cases, with the presence of unpredictable extreme cases. Lots of computational resources could be wasted during off-peak hours and the system may crash when the workload exceeds system capacity. How to support deep learning services with dynamic workload cost-efficiently remains a challenging problem. In this paper, we address the challenge with a general and novel training scheme called model slicing , which enables deep learning models to provide predictions within the prescribed computational resource budget dynamically. Model slicing could be viewed as an elastic computation solution without requiring more computational resources. Succinctly, each layer in the model is divided into groups of contiguous block of basic components (i.e. neurons in dense layers and channels in convolutional layers), and then partially ordered relation is introduced to these groups by enforcing that groups participated in each forward pass always starts from the first group to the dynamically-determined rightmost group. Trained by dynamically indexing the rightmost group with a single parameter slice rate , the network is engendered to build up group-wise and residual representation. Then during inference, a sub-model with fewer groups can be readily deployed for efficiency whose computation is roughly quadratic to the width controlled by the slice rate. Extensive experiments show that models trained with model slicing can effectively support on-demand workload with elastic inference cost.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Neural Network Language Modeling (NNLM) comprises both fully-connected and recurrent layers; we thus adopt NNLM to evaluate the effectiveness of model slicing for dense layers. NNLM [37,54,39] specifies the distribution over next word wt+1 given its preceding word sequence w1:t = [w1, w2, . .…”

Section: Language Modeling Task and Datasetmentioning

confidence: 99%

Model slicing for supporting complex analytics with elastic inference cost and resource constraints

et al. 2019

View full text Add to dashboard Cite

show abstract

“…For the textual input, we use a vocabulary size of 32,000 and a max sequence length of 256 in both the encoder and decoder. We also share parameters between the embedding and the decoder softmax output layer (Press & Wolf, 2016).…”

Section: Pretrainingmentioning

confidence: 99%

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Wang¹,

Yu²,

Yu³

et al. 2021

Preprint

View full text Add to dashboard Cite

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

show abstract

“…Note that at the 10-th epoch we switch from the vanilla LSTM model to the hybrid architecture, we also decay the learning rate by a factor of 0.5. We also tie the word embedding and SoftMax weights (Press & Wolf, 2016).…”

Section: Detailed Hyper-parameters Used In Our Experimentsmentioning

confidence: 99%

Pufferfish: Communication-efficient Models At No Extra Cost

Wang¹,

Agarwal²,

Papailiopoulos³

2021

Preprint

View full text Add to dashboard Cite

To mitigate communication overheads in distributed model training, several studies propose the use of compressed stochastic gradients, usually achieved by sparsification or quantization. Such techniques achieve high compression ratios, but in many cases incur either significant computational overheads or some accuracy loss. In this work, we present PUFFERFISH, a communication and computation efficient distributed training framework that incorporates the gradient compression into the model training process via training low-rank, pre-factorized deep networks. PUFFERFISH not only reduces communication, but also completely bypasses any computation overheads related to compression, and achieves the same accuracy as state-of-the-art, off-the-shelf deep models. PUFFERFISH can be directly integrated into current deep learning frameworks with minimum implementation modification. Our extensive experiments over real distributed setups, across a variety of large-scale machine learning tasks, indicate that PUFFERFISH achieves up to 1.64× end-to-end speedup over the latest distributed training API in PyTorch without accuracy loss. Compared to the Lottery Ticket Hypothesis models, PUFFERFISH leads to equally accurate, small-parameter models while avoiding the burden of "winning the lottery". PUFFERFISH also leads to more accurate and smaller models than SOTA structured model pruning methods.

show abstract

Using the Output Embedding to Improve Language Models

Cited by 68 publications

References 0 publications

Model slicing for supporting complex analytics with elastic inference cost and resource constraints

Model slicing for supporting complex analytics with elastic inference cost and resource constraints

SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Pufferfish: Communication-efficient Models At No Extra Cost

Contact Info

Product

Resources

About