“…For instance, step localization [59,86,106] as well as action segmentation [27,29,62,86] in instructional videos have been widely studied in the early stage. With the growing attention paid to this research topic, various kinds of tasks related to instructional videos have been proposed, e.g., video captioning [37,57,87,106] which generates the description of a video based on the actions and events, visual grounding [35,77] which locates the target in an image according to the language description, and procedure learning [2,22,27,72,73,106] which extracts key-steps in videos.…”