Transformer-based models achieve superior accuracy in the field of natural language processing (NLP) and start to be widely deployed in production. As a popular deployment device, graphic processing units (GPUs) basically adopt the batch processing technique for inferring transformer-based models and achieving high hardware performance. However, as the input sequence lengths of NLP tasks are generally variable and in a heavy-tailed distribution, the batch processing will bring large amounts of redundant computation and hurt the practical efficiency.In this paper, we propose a unified solution for eliminating most redundant computation and gaining performance profit in handling heavy-tailed input of the transformer-based model inference on GPUs. In details, the unified solution includes three strategies for the self-attention module, the multilayer perceptron (MLP) module, and the entire transformer-based model respectively. For the self-attention module, we design a fine-grained strategy, which orchestrates fine-grained parallelism in the self-attention module by indexing the valid block matrix multiplication. For the MLP module, we take the common word-accumulation strategy, which places all sequences in a batch densely. For the entire model, we design a block-organized strategy to link up the fine-grained strategy and the word-accumulation strategy through organizing the data layout of the self-attention module in the grain of block. Applying our solution to eight corpora of the GLUE benchmark, there averagely achieves 63.9% latency reduction in the self-attention module and 28.1% latency reduction in the Bert-base model.
CCS CONCEPTS• Computing methodologies → Massively parallel algorithms; Natural language processing.