Pre-trained language models (PLMs) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLMs, multi-head self-attention is appealing for its ability to jointly attend to information from different positions. However, researchers have found that PLMs always exhibit fixed attention patterns regardless of the input (e.g., excessively paying attention to '[CLS]' or '[SEP]'), which we argue might neglect important information in the other positions. In this work, we propose a simple yet effective attention guiding mechanism to improve the performance of PLMs through encouraging the attention towards the established goals. Specifically, we propose two kinds of attention guiding methods, i.e., the attention map discrimination guiding (MDG) and the attention pattern decorrelation guiding (PDG). The former definitely encourages the diversity among multiple self-attention heads to jointly attend to information from different representation subspaces, while the latter encourages self-attention to attend to as many different positions of the input as possible. We conduct experiments with multiple general pre-trained models (i.e., BERT, ALBERT, and Roberta) and domain-specific pre-trained models (i.e., BioBERT, Clinical-BERT, BlueBert, and SciBERT) on three benchmark datasets (i.e., MultiNLI, MedNLI, and Cross-genre-IR). Extensive experimental results demonstrate that our proposed MDG and PDG bring stable performance improvements on all datasets with high efficiency and low cost.
CCS CONCEPTS• Information systems → Clustering and classification; Content analysis and feature selection; • Computing methodologies → Contrastive learning.