In this paper, we investigate whether some recent image-text foundation models are able to perform classification of animal behavior without any fine-tuning. We experiment zero-shot approaches with two types of models: image-text contrastive learning such as CLIP and multimodal LLMs such as CogVLM. Using a new large dataset of European fauna, we demonstrate that some of these models are already very good at predicting behavior, allowing the estimation of behavior-specific activity patterns almost identical to those derived by participatory science annotations.