“…Recent studies indicate that pre-trained language models like BERT tend to exploit biases in the dataset for prediction, rather than acquiring higher-level semantic understanding and reasoning (Niven and Kao, 2019;Du et al, 2021;McCoy et al, 2019a). There are some preliminary works to mitigate the bias of general pre-trained models, including product-of-experts He et al, 2019;Sanh et al, 2021), reweighting (Schuster et al, 2019;Yaghoobzadeh et al, 2019;Utama et al, 2020), adversarial training (Stacey et al, 2020), posterior regularization (Cheng et al, 2021), etc. Recently, challenging benchmark datasets, e.g., Checklist (Ribeiro et al, 2020) and the Robustness Gym (Goel et al, 2021), have been developed to facilitate the evaluation of the robustness of these models.…”