“…Whether our models have learned to solve tasks in robust and generalizable ways has been a topic of much recent interest. Challenging test sets have shown that many state-of-the-art NLP models struggle with compositionality Kim and Linzen, 2020;Yu and Ettinger, 2020;White et al, 2020), and find it difficult to pass the myriad stress tests for social May et al, 2019;Nangia et al, 2020) and/or linguistic competencies (Geiger et al, 2018;Naik et al, 2018;Glockner et al, 2018;White et al, 2018;Warstadt et al, 2019;Gauthier et al, 2020;Hossain et al, 2020;Jeretic et al, 2020;Lewis et al, 2020;Saha et al, 2020;Schuster et al, 2020;Sugawara et al, 2020;. Yet, challenge sets may suffer from performance instability (Liu et al, 2019a;Rozen et al, 2019; and often lack sufficient statistical power (Card et al, 2020), suggesting that, although they may be valuable assessment tools, they are not sufficient for ensuring that our models have achieved the learning targets we set for them.…”