“…In the scope of activity recognition, most works [13,20,53] study short-range or trimmed videos. Our work is closest to [18,19,54], where the focus is recognizing minuteslong activities. However, unlike them, our paper is on instructional videos, and on how recognition can aid segmentation, so it relies on hierarchical activity labels (top-level task, lower-level attributes as targets for segmentation).…”