RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos

Yu, Tianshu; Li, Yikang; Li, Baoxin

doi:10.1007/978-3-030-58607-2_8

Cited by 5 publications

(6 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Timception [18], VideoGraph [19] and RhyRNN [54] Furthermore, Table 10 shows the original reported results of Timeception and VideoGraph, which are lower than our re-implemented versions in both cases. Contrary to the standard splitting rule of the Breakfast dataset, both works have used the last 0.15% of subjects in the dataset (8 subjects) to test their performance.…”

Section: Task Classification Results On 10 Classes Of the Breakfast D...mentioning

confidence: 90%

“…In the scope of activity recognition, most works [13,20,53] study short-range or trimmed videos. Our work is closest to [18,19,54], where the focus is recognizing minuteslong activities. However, unlike them, our paper is on instructional videos, and on how recognition can aid segmentation, so it relies on hierarchical activity labels (top-level task, lower-level attributes as targets for segmentation).…”

Section: Related Workmentioning

confidence: 99%

“…Furthermore, we applied PCA to the extracted I3D features to reduce the dimensionality of RGB and optical flow channels from 1024 to 128. We fed the same features to all competitors except [54] in Table 10 whose code is not publicly available, so we compare with their reported result on ResNet101 [16] features.…”

Section: I3d and Idt Feature Comparison In Taskmentioning

confidence: 99%

See 2 more Smart Citations

Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos

Ghoddoosian¹,

Sayed²,

Athitsos³

2021

Preprint

View full text Add to dashboard Cite

This paper 1 focuses on task recognition and action segmentation in weakly-labeled instructional videos, where only the ordered sequence of video-level actions is available during training. We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos. Further, we present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences. Experimental results on the popular Breakfast and Cooking 2 datasets show that our two-stream hierarchical task modeling significantly outperforms existing methods in top-level task recognition for all datasets and metrics. Additionally, using our task recognition framework in the proposed topdown action segmentation approach consistently improves the state of the art, while also reducing segmentation inference time by 80-90 percent.

show abstract

Section: Task Classification Results On 10 Classes Of the Breakfast D...mentioning

confidence: 90%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos

Ghoddoosian¹,

Sayed²,

Athitsos³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Independence or modular property serves as strong regularization or prior in some learning tasks under static setting (Wang et al, 2020;Liu et al, 2020). In the sequential case, some early attempts over RNNs emphasized implicit "independence" in the feature space between dimensions or channels (Li et al, 2018;Yu et al, 2020). As independence assumption commonly holds in vision tasks (with distinguishable objects), Pang et al (2020); Li et al (2020b) proposed video understanding schemes by decoupling the spatiotemporal patterns.…”

Section: Related Workmentioning

confidence: 99%

“…Latent entities may not have exact physical meanings, but learning procedures can greatly benefit from such decoupling, as this assumption can be viewed as strong regularization to the system. This assumption has been successfully incorporated in several models for learning from regularly sampled sequential data by emphasizing "independence" to some extent between channels or groups in the feature space (Li et al, 2018;Yu et al, 2020;Goyal et al, 2021;Madan et al, 2021). Another successful counterpart in parallel benefiting from this assumption is transformer (Vaswani et al, 2017) which stacks multiple layers of self-attention and point-wise feedforward networks.…”

Section: Introductionmentioning

confidence: 99%

Representing and Learning Complex Object Interactions

Zhou

Konidaris

Robotics: Science and Systems XII

View full text Add to dashboard Cite

We present a framework for representing scenarios with complex object interactions, in which a robot cannot directly interact with the object it wishes to control, but must instead do so via intermediate objects. For example, a robot learning to drive a car can only indirectly change its pose, by rotating the steering wheel. We formalize such complex interactions as chains of Markov decision processes and show how they can be learned and used for control. We describe two systems in which a robot uses learning from demonstration to achieve indirect control: playing a computer game, and using a hot water dispenser to heat a cup of water.

show abstract