Object recognition is an important human ability that relies on distinguishing between similar objects, for example, deciding which kitchen utensil(s) to use at different stages of meal preparation. Recent work describes the fine-grained organization of knowledge about manipulable objects via the study of the constituent dimensions that are most relevant to human behavior, for example, vision, manipulation, and function-based object properties. A logical extension of this work concerns whether or not these dimensions are uniquely human, or can be approximated by deep learning. Here, we show that behavioral dimensions are well-predicted by a state-of-the-art multimodal network trained on a large and diverse set of image-text pairs - CLIP-ViT - and for the most part, can also generate good predictions of human behavior for previously unseen objects. Moreover, this model vastly outperforms comparison networks pre-trained with smaller, image-only training datasets. These results demonstrate the impressive capacity of CLIP-ViT to approximate fine-grained human object knowledge. We discuss the possible sources of this benefit relative to other tested models (e.g. multimodal image-text pre-training vs. image only pre-training, dataset size, architecture).