Improving the Efficiency of Transformers for Resource-Constrained Devices

Tabani, Hamid; Balasubramaniam, Ajay; Marzban, Shabbir; Zonooz, Bahram

doi:10.1109/dsd53832.2021.00074

Cited by 12 publications

(5 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The controlled-NOT, or CNOT, gate is a two-qubit gate, see (5) and Figure 2, that facilitates multi-qubit entanglement [17]. In this work, we will provide a complexity (depth) analysis of the proposed circuits in terms of the critical path of consecutive single-qubit gates and two-qubit CNOT gates [17].…”

Section: Controlled-not (Cnot) Gatementioning

confidence: 99%

See 1 more Smart Citation

Optimizing Multidimensional Pooling for Variational Quantum Algorithms

Jeng,

Nobel,

Jha

et al. 2024

Algorithms

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) have proven to be a very efficient class of machine learning (ML) architectures for handling multidimensional data by maintaining data locality, especially in the field of computer vision. Data pooling, a major component of CNNs, plays a crucial role in extracting important features of the input data and downsampling its dimensionality. Multidimensional pooling, however, is not efficiently implemented in existing ML algorithms. In particular, quantum machine learning (QML) algorithms have a tendency to ignore data locality for higher dimensions by representing/flattening multidimensional data as simple one-dimensional data. In this work, we propose using the quantum Haar transform (QHT) and quantum partial measurement for performing generalized pooling operations on multidimensional data. We present the corresponding decoherence-optimized quantum circuits for the proposed techniques along with their theoretical circuit depth analysis. Our experimental work was conducted using multidimensional data, ranging from 1-D audio data to 2-D image data to 3-D hyperspectral data, to demonstrate the scalability of the proposed methods. In our experiments, we utilized both noisy and noise-free quantum simulations on a state-of-the-art quantum simulator from IBM Quantum. We also show the efficiency of our proposed techniques for multidimensional data by reporting the fidelity of results.

show abstract

Section: Controlled-not (Cnot) Gatementioning

confidence: 99%

“…In the convolution layer, filters are applied to input data for specific applications, and the pooling layers reduce the spatial dimensions in the generated feature maps [3]. The reduced spatial dimensions generated from the pooling layers reduce memory requirements, which is a major concern for resource-constrained devices [4,5].…”

Section: Introductionmentioning

confidence: 99%

Optimizing Multidimensional Pooling for Variational Quantum Algorithms

Jeng,

Nobel,

Jha

et al. 2024

Algorithms

View full text Add to dashboard Cite

show abstract

“…It relies on massive number of model parameters, excessive memory and computation requirements. [1], [27], [20]. Also representation learned by model on HAR and Epilepsy dataset is shown in Figure 3.…”

Section: Baselinesmentioning

confidence: 99%

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Gorade¹,

Singh²,

Mishra³

2022

Preprint

View full text Add to dashboard Cite

Learning representation from unlabeled time series data is a challenging problem. Most existing self-supervised and unsupervised approaches in the time-series domain do not capture low and high frequency features at the same time. Further some of these methods employ large scale models like transformers or rely on computationally expensive techniques such as contrastive learning. To tackle these problems, we propose a non-contrastive self-supervised learning approach which efficiently captures low and high frequency time varying features in a cost effective manner. Our method takes raw time series data as input and creates two different augmented views for two branches of the model, by randomly sampling the augmentations from same family. Following the terminology of BYOL [1], the two branches are called as online and target network which allow bootstrapping of the latent representation. In contrast to BYOL, where a backbone encoder is followed by multilayer perceptron (MLP) heads, the proposed model contains additional temporal convolutional network (TCN) heads. As the augmented views are passed through large kernel convolution blocks of encoder, the subsequent combination of MLP and TCN enables an effective representation of low as well as high frequency time varying features due to the varying receptive fields. The two modules (MLP and TCN) act in a complementary manner. We train online network where each module learns to predict the outcome of respective module of target network branch. To demonstrate the robustness of our model we performed extensive experiments and ablation studies on five real-world time-series datasets. Our method achieved state-of-art performance on all five real-world datasets.

show abstract

“…However, constrained by the shared memory capacity, it can only outperform FasterTransformer in a very limited range and starts to severely degrade as the sequence length becomes long. Besides, Hamid et al [21] improves the efficiency of transformer inference for resource-constrained devices.…”

Section: Related Workmentioning

confidence: 99%

Handling heavy-tailed input of transformer inference on GPUs

Jiang

You

et al. 2022

Proceedings of the 36th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Transformer-based models achieve superior accuracy in the field of natural language processing (NLP) and start to be widely deployed in production. As a popular deployment device, graphic processing units (GPUs) basically adopt the batch processing technique for inferring transformer-based models and achieving high hardware performance. However, as the input sequence lengths of NLP tasks are generally variable and in a heavy-tailed distribution, the batch processing will bring large amounts of redundant computation and hurt the practical efficiency.In this paper, we propose a unified solution for eliminating most redundant computation and gaining performance profit in handling heavy-tailed input of the transformer-based model inference on GPUs. In details, the unified solution includes three strategies for the self-attention module, the multilayer perceptron (MLP) module, and the entire transformer-based model respectively. For the self-attention module, we design a fine-grained strategy, which orchestrates fine-grained parallelism in the self-attention module by indexing the valid block matrix multiplication. For the MLP module, we take the common word-accumulation strategy, which places all sequences in a batch densely. For the entire model, we design a block-organized strategy to link up the fine-grained strategy and the word-accumulation strategy through organizing the data layout of the self-attention module in the grain of block. Applying our solution to eight corpora of the GLUE benchmark, there averagely achieves 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model. CCS CONCEPTS• Computing methodologies → Massively parallel algorithms; Natural language processing.

show abstract

Improving the Efficiency of Transformers for Resource-Constrained Devices

Cited by 12 publications

References 28 publications

Optimizing Multidimensional Pooling for Variational Quantum Algorithms

Optimizing Multidimensional Pooling for Variational Quantum Algorithms

Large Scale Time-Series Representation Learning via Simultaneous Low and High Frequency Feature Bootstrapping

Handling heavy-tailed input of transformer inference on GPUs

Contact Info

Product

Resources

About