2016
DOI: 10.48550/arxiv.1608.05859
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Using the Output Embedding to Improve Language Models

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
70
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 68 publications
(74 citation statements)
references
References 0 publications
1
70
0
Order By: Relevance
“…We evaluate the performance of model slicing on state-ofthe-art neural networks on two categories of public benchmark tasks, specifically evaluating model slicing for dense layers, i.e. fully-connected and recurrent layers on language modeling [37,54,39] in Section 5.2 and evaluating model slicing for convolutional layers on image classification [43,16,53] in Section 5.3. Experimental setups of model slicing are provided in Section 5.1; cascade ranking simulation of example applications and visualization on the model slicing training are given in Section 5.4 and Section 5.5 respectively.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We evaluate the performance of model slicing on state-ofthe-art neural networks on two categories of public benchmark tasks, specifically evaluating model slicing for dense layers, i.e. fully-connected and recurrent layers on language modeling [37,54,39] in Section 5.2 and evaluating model slicing for convolutional layers on image classification [43,16,53] in Section 5.3. Experimental setups of model slicing are provided in Section 5.1; cascade ranking simulation of example applications and visualization on the model slicing training are given in Section 5.4 and Section 5.5 respectively.…”
Section: Methodsmentioning
confidence: 99%
“…Neural Network Language Modeling (NNLM) comprises both fully-connected and recurrent layers; we thus adopt NNLM to evaluate the effectiveness of model slicing for dense layers. NNLM [37,54,39] specifies the distribution over next word wt+1 given its preceding word sequence w1:t = [w1, w2, . .…”
Section: Language Modeling Task and Datasetmentioning
confidence: 99%
“…For the textual input, we use a vocabulary size of 32,000 and a max sequence length of 256 in both the encoder and decoder. We also share parameters between the embedding and the decoder softmax output layer (Press & Wolf, 2016).…”
Section: Pretrainingmentioning
confidence: 99%
“…Note that at the 10-th epoch we switch from the vanilla LSTM model to the hybrid architecture, we also decay the learning rate by a factor of 0.5. We also tie the word embedding and SoftMax weights (Press & Wolf, 2016).…”
Section: Detailed Hyper-parameters Used In Our Experimentsmentioning
confidence: 99%