After hearing the words Little Red Riding Hood, most humans instantly visualize a girl wearing a red hood in the woods. However, whether nonhuman primates also evoke such visual imagery from sounds remains an open question. We sought to explore this from direct behavioral measurements from two rhesus macaques trained in a delayed crossmodal equivalence task. In each trial, they listened to a sound, such as a monkey vocalization or a word, and three seconds later, selected a visual equivalent out of a pool of 2 to 4 pictures appearing on a touchscreen. We propose two potential mechanisms for the brain to solve this task: acoustic memory or visual imagery. In the first, sounds representations should remain in working memory to discriminate later against the pictures on the screen. In the second, listening to sounds could evoke visual representations of pictures that appear on the touchscreen afterward. After analyzing the monkeys’ choice accuracies and reaction times in the task, we deduce that they experience visual imagery when listening to sounds. Therefore, the ability of rhesus monkeys to perceive crossmodal equivalences between learned categories poses rhesus monkeys as an ideal model organism for studying high-order cognitive processes like semantics and conceptual thinking at the single-neuron level.