Procedure Planning in Instructional Videos

Chang, Chien-Yi; Huang, De-An; Xu, Danfei; Adeli, Ehsan; Li, Feifei; Niebles, Juan Carlos

doi:10.1007/978-3-030-58621-8_20

Cited by 40 publications

(59 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the vision paper of [23], it was recognised that the vast amount of online resources such as videos can be used to drive ubiquitous computing. In fact, the study of [24] suggests harnessing videos to segment its frames (cooking steps) and transform them into a latent space where a Markov Decision Process algorithm is employed to learn the sequence of cooking steps-essentially a planning algorithm that can potentially narrate a cooking workflow.…”

Section: B An Abstract Iot Cooking Workflow Should Respectmentioning

confidence: 99%

IoT Cooking Workflows for End-Users: A Comparison Between Behaviour Trees and the DX-MAN Model

Ventirozos

Batista-Navarro

Clinch

et al. 2021

2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)

View full text Add to dashboard Cite

A kitchen underpinned by the Internet of Things (IoT) requires the management of complex procedural processes. This is due to the fact that when supporting an end-user in the preparation of even only one dish, various devices may need to coordinate with each other. Additionally, it is challengingyet desirable-to enable an end-user to program their kitchen devices according to their preferred behaviour and to allow them to visualise and track their cooking workflows. In this paper, we compared two semantic representations, namely, Behaviour Trees and the DX-MAN model. We analysed these representations based on their suitability for a range of endusers (i.e., novice to experienced). The methodology required the analysis of smart kitchen user requirements, from which we inferred that the main architectural requirements for IoT cooking workflows are variability and compositionality. Guided by the user requirements, we examined various scenarios and analysed workflow complexity and feasibility for each representation. On the one hand, we found that execution complexity tends to be higher on Behaviour Trees. However, due to their fallback node, they provide more transparency on how to recover from unprecedented circumstances. On the other hand, parameter complexity tends to be somewhat higher for the DX-MAN model. Nevertheless, the DX-MAN model can be favourable due to its compositionality aspect and the ease of visualisation it can offer.

show abstract

Section: B An Abstract Iot Cooking Workflow Should Respectmentioning

confidence: 99%

IoT Cooking Workflows for End-Users: A Comparison Between Behaviour Trees and the DX-MAN Model

Ventirozos

Batista-Navarro

Clinch

et al. 2021

2021 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)

View full text Add to dashboard Cite

show abstract

“…To plan in unknown environments, the agent needs to learn the environment dynamics from previous experiences. Recent model-based RL schemes have shown promise that deep networks can learn a transition model directly from low-dimensional observations and plan with the learned model [40,6,11]. A closely related method is Universal Planning Networks (UPN) [32] that learns a plannable latent space with gradient descent by minimizing an imitation loss, i.e., learned from an expert planner.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we focus on learning the goal-directed actions from instructional videos. Recently, Chang et al [6] proposed a new problem known as procedure planning in instructional videos. It requires a model to 1) plan a sequence of verb-argument actions and 2) retrieve the intermediate steps for achieving a given visual goal in real-life tasks such as making a strawberry cake (see Fig.…”

Section: Introductionmentioning

confidence: 99%

Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning

Bi¹,

Luo²,

Xu³

2021

Preprint

View full text Add to dashboard Cite

Learning new skills by observing humans' behaviors is an essential capability of AI. In this work, we leverage instructional videos to study humans' decision-making processes, focusing on learning a model to plan goal-directed actions in real-life videos. In contrast to conventional action recognition, goal-directed actions are based on expectations of their outcomes requiring causal knowledge of potential consequences of actions. Thus, integrating the environment structure with goals is critical for solving this task. Previous works learn a single world model will fail to distinguish various tasks, resulting in an ambiguous latent space; planning through it will gradually neglect the desired outcomes since the global information of the future goal degrades quickly as the procedure evolves. We address these limitations with a new formulation of procedure planning and propose novel algorithms to model human behaviors through Bayesian Inference and model-based Imitation Learning. Experiments conducted on real-world instructional videos show that our method can achieve state-ofthe-art performance in reaching the indicated goals. Furthermore, the learned contextual information presents interesting features for planning in a latent space.

show abstract

“…Instructional video understanding. Beyond image semantics (Yatskar et al, 2016), unlike existing tasks for learning from instructional video (Zhou et al, 2018c;Tang et al, 2019;Alayrac et al, 2016;Song et al, 2015;Sener et al, 2015;Huang et al, 2016;Sun et al, 2019b,a;Plummer et al, 2017;Palaskar et al, 2019), combining video & text information in procedures (Yagcioglu et al, 2018;Fried et al, 2020), visual-linguistic reference resolution (Huang et al, 2018(Huang et al, , 2017, visual planning (Chang et al, 2019), joint learning of object and actions Richard et al, 2018;Gao et al, 2017;Damen et al, 2018b), pretraining joint embedding of high level sentence with video clips (Sun et al, 2019b;Miech et al, 2019), our task proposal requires explicit structured knowledge tuple extraction.…”

Section: Related Workmentioning

confidence: 99%

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Xu¹,

Ji²,

Shi³

et al. 2020

Proceedings of the First International Workshop on Natural Language Processing Beyond Text

View full text Add to dashboard Cite

Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in structured form.

show abstract

Procedure Planning in Instructional Videos

Cited by 40 publications

References 28 publications

IoT Cooking Workflows for End-Users: A Comparison Between Behaviour Trees and the DX-MAN Model

IoT Cooking Workflows for End-Users: A Comparison Between Behaviour Trees and the DX-MAN Model

Procedure Planning in Instructional Videos via Contextual Modeling and Model-based Policy Learning

A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos

Contact Info

Product

Resources

About