“…And, as several psycholinguistic studies have demonstrated that the IC bias is not only highly reliable but also robust across different languages (Ferstl et al, 2011;Goikoetxea et al, 2008;Hartshorne et al, 2013;Bott and Solstad, 2014), it has become an an intriguing domain for testing language models. Earlier studies, including those conducted by Upadhye et al, 2020, Davis and van Schijndel, 2020, Kementchedjhieva et al, 2021and Zarrieß et al, 2022, have examined the performance of LLMs in capturing the IC coreference bias. I.e., they concentrated on single-word prediction tasks and evaluated the models' ability to generate continuations of such classic prompts, like examples (1) and (2), and predominantly found that LLMs display limited ability to systematically incorporate the IC coreference bias in their genera-tions.…”