How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

Hassid, Michael; Peng, Hao; Rotem, Daniel; Kasai, Jungo; Montero, Ivan; Smith, Noah A.; Schwartz, Roy

doi:10.18653/v1/2022.findings-emnlp.101

Cited by 4 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Following recent research (Hassid et al 2022), we first replace the dynamic self-attention matrix A n with a constant attention matrix C n ∈ R N h ×Nt×Nt . We initialize C n with the average of A n in train set, i.e.,…”

Section: The Merge Module (Mm)mentioning

confidence: 99%

“…Embedding resending helps to bypass the embedding table query operation and decouple the computation between forward representation learning and next token sampling. Besides, following the recent research (Hassid et al 2022) in attention mechanism, we approximate self-attention with constant attention matrices and merge tensor computations in the Transformer module before inference. Nevertheless, these two strategies are challenging because: 1) PLMs are usually sensitive to input embeddings, while there are some unavoidable errors in the generated embeddings; 2) constant attention in our merge module might hurt the performance of PLMs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MERGE: Fast Private Text Generation

Liang,

Wang,

Zhang

et al. 2024

AAAI

View full text Add to dashboard Cite

The drastic increase in language models' parameters has led to a new trend of deploying models in cloud servers, raising growing concerns about private inference for Transformer-based models. Existing two-party privacy-preserving techniques, however, only take into account natural language understanding (NLU) scenarios. Private inference in natural language generation (NLG), crucial for applications like translation and code completion, remains underexplored. In addition, previous privacy-preserving techniques suffer from convergence issues during model training and exhibit poor inference speed when used with NLG models due to the neglect of time-consuming operations in auto-regressive generations. To address these issues, we propose a fast private text generation framework for Transformer-based language models, namely MERGE. MERGE reuses the output hidden state as the word embedding to bypass the embedding computation and reorganize the linear operations in the Transformer module to accelerate the forward procedure. Extensive experiments show that MERGE achieves a 26.5x speedup to the vanilla encrypted model under the sequence length 512, and reduces 80% communication cost, with an up to 10x speedup to state-of-the-art approximated models.

show abstract

Section: The Merge Module (Mm)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%