Molding CNNs for text: non-linear, non-consecutive convolutions

Leí, Tao; Barzilay, Regina; Jaakkola, Tommi S.

doi:10.18653/v1/d15-1180

Cited by 89 publications

(72 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One area of research involves incorporating word-level convolutions (i.e. n-gram filters) into recurrent computation (Lei et al, 2015;Bradbury et al, 2017;Lei et al, 2017). For example, Quasi-RNN (Bradbury et al, 2017) proposes to alternate convolutions and a minimalist recurrent pooling function and achieves significant speed-up over LSTM.…”

Section: Related Workmentioning

confidence: 99%

Simple Recurrent Units for Highly Parallelizable Recurrence

Leí¹,

Wang²,

Ding³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

177

133

View full text Add to dashboard Cite

Common recurrent neural architectures scale poorly due to the intrinsic difficulty in parallelizing their state computations. In this work, we propose the Simple Recurrent Unit (SRU), a light recurrent unit that balances model capacity and scalability. SRU is designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models. We demonstrate the effectiveness of SRU on multiple NLP tasks. SRU achieves 5-9x speed-up over cuDNN-optimized LSTM on classification and question answering datasets, and delivers stronger results than LSTM and convolutional models. We also obtain an average of 0.7 BLEU improvement over the Transformer model (Vaswani et al., 2017) on translation by incorporating SRU into the architecture. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Simple Recurrent Units for Highly Parallelizable Recurrence

Leí¹,

Wang²,

Ding³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Self Cite

177

133

View full text Add to dashboard Cite

show abstract

“…layer1 layer2 layer3 layer4 layer5 layer6 layer7 layer8 layer9 layer10 layer11 layer12 layer13 layer14 layer15 Table 1 presents the test-set accuracies obtained by different strategies. Results in Table 1 indicate that the AGT method achieved very competitive accuracy (with 50.5%), when compared to the state-of-the-art results obtained by the tree-LSTM (51.0%) (Tai et al, 2015;Zhu et al, 2015) and high-order CNN approaches (51.2%) (Lei et al, 2015).…”

Section: Resultsmentioning

confidence: 89%

“…We learned 15 layers with 200 dimensions each, which requires us to project the 300 dimensional word vectors; we implemented this using a linear transformation, whose weight matrix and bias term are shared across all words, followed by a tanh activation. For optimization, we used Adadelta (Zeiler, 2012), with learning rate of 0.0005, mini-batch of 50, transform gate bias of 1, and dropout (Srivastava et al, 2014) (Lei et al, 2015). layer1 layer2 layer3 layer4 layer5 layer6 layer7 layer8 layer9 layer10 layer11 layer12 layer13 layer14 layer15 Table 1 presents the test-set accuracies obtained by different strategies.…”

Section: Resultsmentioning

confidence: 99%

A Deep Network with Visual Text Composition Behavior

Guo¹

2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 2: Short Papers)

View full text Add to dashboard Cite

While natural languages are compositional, how state-of-the-art neural models achieve compositionality is still unclear. We propose a deep network, which not only achieves competitive accuracy for text classification, but also exhibits compositional behavior. That is, while creating hierarchical representations of a piece of text, such as a sentence, the lower layers of the network distribute their layer-specific attention weights to individual words. In contrast, the higher layers compose meaningful phrases and clauses, whose lengths increase as the networks get deeper until fully composing the sentence.

show abstract

“…(2) t is used as output for onward computation. Different strategies to computing λ t were explored (Lei et al, 2015(Lei et al, , 2016. When λ t is a constant, or depends only on x t , e.g., λ t = σ(W λ v t +b λ ), the ith dimension of Equations 14…”

Section: More Than Two Statesmentioning

confidence: 99%

Rational Recurrences

Peng

Schwartz²,

Thomson

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Despite the tremendous empirical success of neural models in natural language processing, many of them lack the strong intuitions that accompany classical machine learning approaches. Recently, connections have been shown between convolutional neural networks (CNNs) and weighted finite state automata (WFSAs), leading to new interpretations and insights. In this work, we show that some recurrent neural networks also share this connection to WFSAs. We characterize this connection formally, defining rational recurrences to be recurrent hidden state update functions that can be written as the Forward calculation of a finite set of WFSAs. We show that several recent neural models use rational recurrences. Our analysis provides a fresh view of these models and facilitates devising new neural architectures that draw inspiration from WFSAs. We present one such model, which performs better than two recent baselines on language modeling and text classification. Our results demonstrate that transferring intuitions from classical models like WFSAs can be an effective approach to designing and understanding neural models.

show abstract

Molding CNNs for text: non-linear, non-consecutive convolutions

Cited by 89 publications

References 24 publications

Simple Recurrent Units for Highly Parallelizable Recurrence

Simple Recurrent Units for Highly Parallelizable Recurrence

A Deep Network with Visual Text Composition Behavior

Rational Recurrences

Contact Info

Product

Resources

About