“…Many recent studies have treated neural LMs and contextualized word prediction models-primarily LSTM LMs (Sundermeyer et al, 2012), GPT-2 (Radford et al, 2019), and BERT (Devlin et al, 2019)-as psycholinguistic subjects to be studied behaviorally (Linzen et al, 2016;Gulordava et al, 2018;Goldberg, 2019). Some have studied whether models prefer grammatical completions in subjectverb agreement contexts (Marvin and Linzen, 2018;van Schijndel et al, 2019;Goldberg, 2019;Mueller et al, 2020;Lakretz et al, 2021;, as well as in filler-gap dependencies (Wilcox et al, 2018. These are based on the approach of Linzen et al (2016), where a model's ability to syntactically generalize is measured by its ability to choose the correct inflection in difficult structural contexts instantiated by tokens that the model has not seen together during training.…”