“…Generally speaking, multi-frame pose estimation approaches [9,26,29,37,46,65,66] show advantages over single-frame ones. Specifically, some works apply temporal models (e.g., GRUs [9,29,37], TCNs [46,65], and Transformers [59,69]) for feature extraction, ensuring the pose estimators have continuous inputs on time sequences. Other methods employ regularizers or loss functions for smoothness [26,41,43,54,57,67] to constrain the temporal consistency across successive frames.…”