“…Following previous literature (e.g., Lazaridou et al, 2017), we extracted, for each image, the 4096dimensional vector representation in the second-to-last layer (fc7), which is believed to capture complex, abstract, gestalt-level representations of objects (LeCun, Bengio, & Hinton, 2015;Smith, Pezzelle, Franzon, Zanini, & Bernardi, 2017;Zeiler & Fergus, 2014). For each word stimulus, a unique vision-based representation was then estimated as its prototypical visual activation, operationalized as the average vector for all pictures extracted for the given word (see Günther, Petilli, & Marelli, 2020;Lazaridou et al, 2017). Items that deviated too far (in terms of cosine similarity) from the mean activation value within a category (interquartile ranges over 1.5) were not included in the averaging process.…”