“…A number of additional evaluation approaches have been proposed, such as evaluating robustness to noise (Belinkov and Bisk, 2018;Rychalska et al, 2019) or adversarial changes (Ribeiro et al, 2018;Iyyer et al, 2018), fairness (Prabhakaran et al, 2019), logical consistency , explanations (Ribeiro et al, 2016), diagnostic datasets (Wang et al, 2019b), and interactive error analysis (Wu et al, 2019). However, these approaches focus either on individual tasks such as Question Answering or Natural Language Inference, or on a few capabilities (e.g.…”