Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Bengio, Yoshua; Senécal, Jean-Sébastien

doi:10.1109/tnn.2007.912312

Cited by 323 publications

(347 citation statements)

References 16 publications

Supporting

Mentioning

251

Contrasting

Unclassified

Order By: Relevance

“…(2).) To address this problem, we use the approach presented in (Jean et al, 2015), which is based on importance sampling (Bengio and Sénécal, 2008). During training, we choose a smaller vocabulary size τ and divide the training set into partitions, each of which contains approximately τ unique target words.…”

Section: Very Large Target Vocabulary Extensionmentioning

confidence: 99%

Montreal Neural Machine Translation Systems for WMT’15

Jean¹,

Fırat²,

Cho³

et al. 2015

Proceedings of the Tenth Workshop on Statistical Machine Translation

120

View full text Add to dashboard Cite

Neural machine translation (NMT) systems have recently achieved results comparable to the state of the art on a few translation tasks, including English→French and English→German. The main purpose of the Montreal Institute for Learning Algorithms (MILA) submission to WMT'15 is to evaluate this new approach on a greater variety of language pairs. Furthermore, the human evaluation campaign may help us and the research community to better understand the behaviour of our systems. We use the RNNsearch architecture, which adds an attention mechanism to the encoderdecoder. We also leverage some of the recent developments in NMT, including the use of large vocabularies, unknown word replacement and, to a limited degree, the inclusion of monolingual language models.

show abstract

Section: Very Large Target Vocabulary Extensionmentioning

confidence: 99%

Montreal Neural Machine Translation Systems for WMT’15

Jean¹,

Fırat²,

Cho³

et al. 2015

Proceedings of the Tenth Workshop on Statistical Machine Translation

120

View full text Add to dashboard Cite

show abstract

“…Relation of w, c Representation of c Skip-gram [18] c predicts w one of c CBOW [18] c predicts w average Order c predicts w concatenation LBL [22] c predicts w compositionality NNLM [2] c predicts w compositionality C&W [3] scores w, c compositionality Table 1: A summary of the investigated models, including how they model the relationship between the target word w and its context c, and how the models use the embeddings of the context words to represent the context. still few works that offer fair comparisons among existing word embedding algorithms.…”

Section: Modelmentioning

confidence: 99%

“…In contrast, the Order model (Section 2.1.5) uses the concatenation of the context words' embeddings, which maintains the word order information. Furthermore, the LBL [22], NNLM [2] and C&W models add a hidden layer to the Order model. Thus, these models use the semantic compositionality [10] of the context words as the context representation.…”

Section: Modelmentioning

confidence: 99%

“…We denote the embedding of word w by e(w). Bengio et al [2] first proposed a Neural Network Language Model (NNLM) that simultaneously learns a word embedding and a language model. A language model utilizes several previous words to predict the distribution of the next word.…”

Section: Modelsmentioning

confidence: 99%

“…Word embedding [2,3], also known as distributed word representation [34], can capture both the semantic and syntactic information of words from a large unlabeled corpus [20] and has attracted considerable attention from many researchers. In recent years, several models have been proposed, and they have yielded state-of-the-art results in many natural language processing (NLP) tasks.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

How to Generate a Good Word Embedding

et al. 2016

View full text Add to dashboard Cite

We analyze three critical components of word embedding training: the model, the corpus, and the training parameters. We systematize existing neural-network-based word embedding algorithms and compare them using the same corpus. We evaluate each word embedding in three ways: analyzing its semantic properties, using it as a feature for supervised tasks and using it to initialize neural networks. We also provide several simple guidelines for training word embeddings. First, we discover that corpus domain is more important than corpus size. We recommend choosing a corpus in a suitable domain for the desired task, after that, using a larger corpus yields better results. Second, we find that faster models provide sufficient performance in most cases, and more complex models can be used if the training corpus is sufficiently large. Third, the early stopping metric for iterating should rely on the development set of the desired task rather than the validation loss of training embedding.

show abstract

Materials Research at Shanghai Jiao Tong University

Chen

Feng

2015

Advanced Materials

View full text Add to dashboard Cite

Transformer architectures have exhibited remarkable performance in image super-resolution (SR). Since the quadratic computational complexity of the self-attention (SA) in Transformer, existing methods tend to adopt SA in a local region to reduce overheads. However, the local design restricts the global context exploitation, which is critical for accurate image reconstruction. In this work, we propose the Recursive Generalization Transformer (RGT) for image SR, which can capture global spatial information and is suitable for high-resolution images. Specifically, we propose the recursive-generalization self-attention (RG-SA). It recursively aggregates input features into representative feature maps, and then utilizes cross-attention to extract global information. Meanwhile, the channel dimensions of attention matrices (query, key, and value) are further scaled for a better trade-off between computational overheads and performance. Furthermore, we combine the RG-SA with local self-attention to enhance the exploitation of the global context, and propose the hybrid adaptive integration (HAI) for module integration. The HAI allows the direct and effective fusion between features at different levels (local or global). Extensive experiments demonstrate that our RGT outperforms recent state-of-the-art methods.

show abstract

Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model

Cited by 323 publications

References 16 publications

Montreal Neural Machine Translation Systems for WMT’15

Montreal Neural Machine Translation Systems for WMT’15

How to Generate a Good Word Embedding

Materials Research at Shanghai Jiao Tong University

Contact Info

Product

Resources

About