Temporal action segmentation (TAS) is a critical step toward long-term video understanding. Recent studies follow a pattern that builds models based on features instead of raw video picture information. However, we claim those models are trained complicatedly and limit application scenarios. It is hard for them to segment human actions of video in real time because they must work after the full video features are extracted. As the real-time action segmentation task is different from TAS task, we define it as streaming video real-time temporal action segmentation (SVTAS) task. In this paper, we propose a real-time end-to-end multi-modality model for SVTAS task. More specifically, under the circumstances that we cannot get any future information, we segment the current human action of streaming video chunk in real time. Furthermore, the model we propose combines the last steaming video chunk feature extracted by language model with the current image feature extracted by image model to improve the quantity of real-time temporal action segmentation. To the best of our knowledge, it is the first multi-modality real-time temporal action segmentation model. Under the same evaluation criteria as full video temporal action segmentation, our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model. Code is available at https://github.com/Thinksky5124/SVTAS.git.
Temporal Action Segmentation (TAS) has achieved great success in many fields such as exercise rehabilitation, movie editing, etc. Currently, task-driven TAS is a central topic in human action analysis. However, motion-centered TAS, as an important topic, is little researched due to unavailable datasets. In order to explore more models and practical applications of motion-centered TAS, we introduce a Motion-Centered Figure Skating (MCFS) dataset in this paper. Compared with existing temporal action segmentation datasets, the MCFS dataset is fine-grained semantic, specialized and motion-centered. Besides, RGB-based and Skeleton-based features are provided in the MCFS dataset. Experimental results show that existing state-of-the-art methods are difficult to achieve excellent segmentation results (including accuracy, edit and F1 score) in the MCFS dataset. This indicates that MCFS is a challenging dataset for motion-centered TAS. The latest dataset can be downloaded at https://shenglanliu.github.io/mcfs-dataset/.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.