Joint Discovery of Object States and Manipulation Actions

Alayrac, Jean-Baptiste; Šivic, Josef; Laptev, Ivan; Lacoste-Julien, Simon

doi:10.1109/iccv.2017.234

Cited by 67 publications

(66 citation statements)

References 38 publications

(92 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Learning from instructional videos. Instructional videos are rising in popularity in the context of learning steps of complex tasks [2,16,41,42,46,68], visual-linguistic reference resolution [17,18], action segmentation in long untrimmed videos [66] and joint learning of object states and actions [3]. Related to our work, [2,30,62] also consider automatically generated transcription of narrated instructional videos as a source of supervision.…”

Section: Related Workmentioning

confidence: 99%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

693

861

View full text Add to dashboard Cite

Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1].

show abstract

Section: Related Workmentioning

confidence: 99%

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Miech¹,

Zhukov²,

Alayrac³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

693

861

View full text Add to dashboard Cite

show abstract

“…In Drawing, the model attends to specific parts of the sketch such as the head and mouth. The high activations in the Chopstick-Using task occur on the hand position (3,4), chopstick position (2) and the bean locations (1,2,3). Further qualitative results are shown in the supplementary video.…”

Section: Visualizing Performance Rankingmentioning

confidence: 83%

“…From Fig.7 we can see that the trained model is picking details that correspond to what a human would attend to. In Dough-Rolling high activations occur on holes in the dough (1,3), curved or rolled edges (4) and when using a spoon (2). High activations occur in Surgery when strain is put on the material (1, 2), with abnormal needle passes (3) and when there is loose stitching (4).…”

Section: Visualizing Performance Rankingmentioning

confidence: 99%

Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

Doughty

Damen

Mayol-Cuevas

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

View full text Add to dashboard Cite

This paper presents a method for assessing skill from video, applicable to a variety of tasks, ranging from surgery to drawing and rolling pizza dough. We formulate the problem as pairwise (who's better?) and overall (who's best?) ranking of video collections, using supervised deep ranking. We propose a novel loss function that learns discriminative features when a pair of videos exhibit variance in skill, and learns shared features when a pair of videos exhibit comparable skill levels. Results demonstrate our method is applicable across tasks, with the percentage of correctly ordered pairs of videos ranging from 70% to 83% for four datasets. We demonstrate the robustness of our approach via sensitivity analysis of its parameters.We see this work as effort toward the automated organization of how-to video collections and overall, generic skill determination in video.

show abstract

“…In liquid pouring sequences, container and the liquid state can be estimated from RGB inputs. Alayrac et al [6] model the interaction between actions and objects in a discrete manner. Some methods further demonstrate that liquid amount can be estimated by combining semantic segmentation CNN and LSTM [34,7].…”

Section: Related Workmentioning

confidence: 99%

Liquid Pouring Monitoring via Rich Sensory Inputs

Tie-jun

Lin

Wang

et al. 2018

Computer Vision – ECCV 2018

View full text Add to dashboard Cite

Humans have the amazing ability to perform very subtle manipulation task using a closed-loop control system with imprecise mechanics (i.e., our body parts) but rich sensory information (e.g., vision, tactile, etc.). In the closed-loop system, the ability to monitor the state of the task via rich sensory information is important but often less studied. In this work, we take liquid pouring as a concrete example and aim at learning to continuously monitor whether liquid pouring is successful (e.g., no spilling) or not via rich sensory inputs. We mimic humans' rich sensories using synchronized observation from a chest-mounted camera and a wrist-mounted IMU sensor. Given many success and failure demonstrations of liquid pouring, we train a hierarchical LSTM with late fusion for monitoring. To improve the robustness of the system, we propose two auxiliary tasks during training: inferring (1) the initial state of containers and (2) forecasting the one-step future 3D trajectory of the hand with an adversarial training procedure. These tasks encourage our method to learn representation sensitive to container states and how objects are manipulated in 3D. With these novel components, our method achieves ∼ 8% and ∼ 11% better monitoring accuracy than the baseline method without auxiliary tasks on unseen containers and unseen users respectively.

show abstract

Joint Discovery of Object States and Manipulation Actions

Cited by 67 publications

References 38 publications

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Who's Better? Who's Best? Pairwise Deep Ranking for Skill Determination

Liquid Pouring Monitoring via Rich Sensory Inputs

Contact Info

Product

Resources

About