Generating Long Sequences with Sparse Transformers

Child, Rewon; Gray, Scott; Radford, Alec; Sutskever, Ilya

doi:10.48550/arxiv.1904.10509

Cited by 369 publications

(622 citation statements)

References 0 publications

Supporting

Mentioning

618

Contrasting

Unclassified

Order By: Relevance

“…There has been researches revealing that, with certain techniques regularizing the head subspace, multi-head attention can learn desired diverse representations [12,16,18]. Considering that the spatial information becomes abstract after downsampling, we intend to strengthen the spatially representational power of multi-head attention.…”

Section: Large Window Attentionmentioning

confidence: 99%

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Yan¹,

Zhang²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

Multi-scale representations are crucial for semantic segmentation. The community has witnessed the flourish of semantic segmentation convolutional neural networks (CNN) exploiting multi-scale contextual information. Motivated by that the vision transformer (ViT) is powerful in image classification, some semantic segmentation ViTs are recently proposed, most of them attaining impressive results but at a cost of computational economy. In this paper, we succeed in introducing multi-scale representations into semantic segmentation ViT via window attention mechanism and further improves the performance and efficiency. To this end, we introduce large window attention which allows the local window to query a larger area of context window at only a little computation overhead. By regulating the ratio of the context area to the query area, we enable the large window attention to capture the contextual information at multiple scales. Moreover, the framework of spatial pyramid pooling is adopted to collaborate with the large window attention, which presents a novel decoder named large window attention spatial pyramid pooling (LawinASPP) for semantic segmentation ViT. Our resulting ViT, Lawin Transformer, is composed of an efficient hierachical vision transformer (HVT) as encoder and a LawinASPP as decoder. The empirical results demonstrate that Lawin Transformer offers an improved efficiency compared to the existing method. Lawin Transformer further sets new state-of-the-art performance on Cityscapes (84.4% mIoU), ADE20K (56.2% mIoU) and COCO-Stuff datasets. The code will be released at https://github.com/yan-hao-tian/lawin.

show abstract

Section: Large Window Attentionmentioning

confidence: 99%

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Yan¹,

Zhang²,

Wu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear). For exam-ple, Linformer and Performer (Choromanski et al, 2020) leverage low-rank selfattention; Sparse Transformers (Child et al, 2019) and Big Bird (Zaheer et al, 2020) utilize sparse self-attention; Reformer introduces learnable attention patterns, and Synthesizer (Tay et al, 2021) introduces randomized attention patterns.…”

Section: Related Workmentioning

confidence: 99%

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Zhang¹,

Liu²

2022

Preprint

View full text Add to dashboard Cite

News feed recommendation is an important web service. In recent years, pre-trained language models (PLMs) have been intensively applied to improve the recommendation quality. However, the utilization of these deep models is limited in many aspects, such as lack of explainability and being incompatible with the existing inverted index systems. Above all, the PLMs based recommenders are inefficient, as the encoding of user-side information will take huge computation costs. Although the computation can be accelerated with efficient transformers or distilled PLMs, it is still not enough to make timely recommendations for the active users, who are associated with super long news browsing histories.In this work, we tackle the efficient news recommendation problem from a distinctive perspective. Instead of relying on the entire input (i.e., the collection of news articles a user ever browsed), we argue that the user's interest can be fully captured merely with those representative keywords. Motivated by this, we propose GateFormer, where the input data is gated before feeding into transformers. The gating module is made personalized, lightweight and end-to-end learnable, such that it may perform accurate and efficient filtering of informative user input. GateFormer achieves highly impressive performances in experiments, where it notably outperforms the existing acceleration approaches in both accuracy and efficiency. We also surprisingly find that even with over 10-fold compression of the original input, GateFormer is still able to maintain onpar performances with the SOTA methods.

show abstract

“…Approximated attention methods have been proposed to tackle this problem. Sparse Transformer [17], LogSparse Transformer [18], Longformer [19], and Big Bird [20] use sparse attention mechanism. Linformer [21] and Synthesizer [22] apply low-rank projection attention.…”

Section: Related Workmentioning

confidence: 99%

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Liu¹,

Khalitov²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches. The model and experiments are available at https://github.com/LeiCheng-no/CDIL-CNN.

show abstract

Generating Long Sequences with Sparse Transformers

Cited by 369 publications

References 0 publications

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

Lawin Transformer: Improving Semantic Segmentation Transformer with Multi-Scale Representations via Large Window Attention

GateFormer: Speeding Up News Feed Recommendation with Input Gated Transformers

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Contact Info

Product

Resources

About