Random Feature Attention

Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng

doi:10.48550/arxiv.2103.02143

Cited by 46 publications

(30 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Linformer [21] and Synthesizer [22] apply low-rank projection attention. Performer [23], Linear Transformer [24], and Random Feature Attention [25] rely on kernel approximation. Reformer [26], Routing Transformer [27], and Sinkhorn Transformer [28] follow the paradigm of re-arranging sequences.…”

Section: Related Workmentioning

confidence: 99%

“…The suite consists of different data types, such as images and texts. Many Transformers have been evaluated on the suite [25,50,51,52]. We compared our CDIL-CNN with other models on the following datasets:…”

Section: Long Range Arena Benchmarkmentioning

confidence: 99%

“…Table 1 gives the model sizes. For this group of experiments, we quoted Transformers' results from reference papers [25,50,51,52] and ran LSTM, GRU, TCN, and CDIL-CNN for comparison. During training, we used a categorical cross-entropy loss function and an Adam optimizer with the learning rate of 0.001.…”

Section: Supplemental Document Classification Of Long Sequential Data...mentioning

confidence: 99%

See 2 more Smart Citations

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Liu¹,

Khalitov²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Classification of long sequential data is an important Machine Learning task and appears in many application scenarios. Recurrent Neural Networks, Transformers, and Convolutional Neural Networks are three major techniques for learning from sequential data. Among these methods, Temporal Convolutional Networks (TCNs) which are scalable to very long sequences have achieved remarkable progress in time series regression. However, the performance of TCNs for sequence classification is not satisfactory because they use a skewed connection protocol and output classes at the last position. Such asymmetry restricts their performance for classification which depends on the whole sequence. In this work, we propose a symmetric multi-scale architecture called Circular Dilated Convolutional Neural Network (CDIL-CNN), where every position has an equal chance to receive information from other positions at the previous layers. Our model gives classification logits in all positions, and we can apply a simple ensemble learning to achieve a better decision. We have tested CDIL-CNN on various long sequential datasets. The experimental results show that our method has superior performance over many state-of-the-art approaches. The model and experiments are available at https://github.com/LeiCheng-no/CDIL-CNN.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Long Range Arena Benchmarkmentioning

confidence: 99%

Section: Supplemental Document Classification Of Long Sequential Data...mentioning

confidence: 99%

See 1 more Smart Citation

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Liu¹,

Khalitov²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…This limits the capability of Transformers, e.g., learning fine-grained feature representation as required in many visual recognition tasks. Linear Transformers Recently, there have been a number of linear/efficient variants [5,34,18,19,30,24,17] of Transformers in NLP. For example, [34] learns to shrink the length of Key and Value based on a low-rank assumption.…”

Section: Related Workmentioning

confidence: 99%

SOFT: Softmax-free Transformer with Linear Complexity

Lu¹,

Yao²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic complexity in both computation and memory usage. Various attempts on approximating the selfattention computation with linear complexity have been made in Natural Language Processing. However, an in-depth analysis in this work shows that they are either theoretically flawed or empirically ineffective for visual recognition. We further identify that their limitations are rooted in keeping the softmax self-attention during approximations. Specifically, conventional self-attention is computed by normalizing the scaled dot-product between token feature vectors. Keeping this softmax operation challenges any subsequent linearization efforts. Based on this insight, for the first time, a softmax-free transformer or SOFT is proposed. To remove softmax in self-attention, Gaussian kernel function is used to replace the dot-product similarity without further normalization. This enables a full self-attention matrix to be approximated via a low-rank matrix decomposition. The robustness of the approximation is achieved by calculating its Moore-Penrose inverse using a Newton-Raphson method. Extensive experiments on ImageNet show that our SOFT significantly improves the computational efficiency of existing ViT variants. Crucially, with a linear complexity, much longer token sequences are permitted in SOFT, resulting in superior trade-off between accuracy and complexity.

show abstract

“…The parameters of both models are learned jointly to maximize the likelihood of the target sentences given the corresponding source sentences from a parallel corpus. More robust and expressive neural MT systems have also been developed (Guo et al, 2020;Zhu et al, 2020;Kasai et al, 2021a,b;Lioutas and Guo, 2020;Peng et al, 2021;Tay et al, 2021;Nguyen and Salazar, 2019;Wang et al, 2019;Xiong et al, 2020b) based on attention mechanism Luong et al, 2015).…”

Section: Related Workmentioning

confidence: 99%

Faster Nearest Neighbor Machine Translation

Wang¹,

Li²,

Meng³

et al. 2021

Preprint

View full text Add to dashboard Cite

kNN based neural machine translation (kNN-MT) has achieved state-of-the-art results in a variety of MT tasks. One significant shortcoming of kNN-MT lies in its inefficiency in identifying the k nearest neighbors of the query representation from the entire datastore, which is prohibitively time-intensive when the datastore size is large.In this work, we propose Faster kNN-MT to address this issue. The core idea of Faster kNN-MT is to use a hierarchical clustering strategy to approximate the distance between the query and a data point in the datastore, which is decomposed into two parts: the distance between the query and the center of the cluster that the data point belongs to, and the distance between the data point and the cluster center. We propose practical ways to compute these two parts in a significantly faster manner. Through extensive experiments on different MT benchmarks, we show that Faster kNN-MT is faster than Fast kNN- MT (Meng et al., 2021a) and only slightly (1.2 times) slower than its vanilla counterpart, while preserving model performance as kNN-MT. Faster kNN-MT enables the deployment of kNN-MT models on real-world MT services.

show abstract

Random Feature Attention

Cited by 46 publications

References 40 publications

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

SOFT: Softmax-free Transformer with Linear Complexity

Faster Nearest Neighbor Machine Translation

Contact Info

Product

Resources

About