“…On the one hand, even non‐fine‐tuned LLMs perform well on multiple tasks designed to probe world knowledge, such as the Winograd Schema Challenge (WSC; Levesque, Davis, & Morgenstern, 2012), the Story Cloze Test (SWAG; Zellers et al., 2018), and the Choice of Plausible Alternatives Test (COPA; Roemmele, Bejan, & Gordon, 2011), so much so that some authors have proposed and evaluated their use as off‐the‐shelf knowledge base models (Kassner, Dufter, & Schütze, 2021; Petroni et al., 2019; Roberts et al., 2020; Tamborrino, Pellicanò, Pannier, Voitot, & Naudin, 2020). On the other hand, studies using more fine‐grained tests have shown that world knowledge in contemporary LLMs is often brittle and depends strongly on the specific way the problem is stated (Elazar et al., 2021a; 2021b; Ettinger, 2020; Kassner & Schütze, 2020; McCoy, Pavlick, & Linzen, 2019; Niven & Kao, 2019; Pedinotti et al., 2021; Ravichander, Hovy, Suleman, Trischler, & Cheung, 2020; Ribeiro, Wu, Guestrin, & Singh, 2020). For example, some authors have noted that, when low‐level co‐occurrence statistics are properly controlled for, LLMs that were considered to have high accuracy on world knowledge tasks start to perform randomly (Elazar, Zhang, Goldberg, & Roth, 2021b; Sakaguchi, Bras, Bhagavatula, & Choi, 2021), highlighting the potential discrepancy between the word‐in‐context prediction objective (which benefits from tracking surface‐level statistics) and world knowledge acquisition (which should be invariant to surface‐level statistics).…”