Animal behaviour is shaped to a large degree by internal cognitive states, but it is unknown whether these states are similar across species. To address this question, we developed a virtual reality setup in which mice and macaques engage in the same naturalistic visual foraging task. We exploited the richness of a wide range of facial features extracted from video recordings during the task, to train a Markov-Switching Linear Regression (MSLR). By doing so, we identified, on a single-trial basis, a set of internal states that reliably predicted when the animals were going to react. At any one moment, one state was dominant, and mice transitioned through these dominant states faster than monkeys. The model could predict not only reaction time, but also task outcome, supporting the behavioural relevance of the inferred states. The thus characterised states were similar between mice and monkeys. Thus, we were able to flexibly and agnostically track the dynamics of several internal states, identifying general principles of naturalistic cognitive processing across species.