A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection

Singh, Bharat; Marks, Tim K.; Jones, M. G. K.; Tuzel, Oncel; Shao, Ming

doi:10.1109/cvpr.2016.216

Cited by 395 publications

(307 citation statements)

References 21 publications

Supporting

Mentioning

306

Contrasting

Unclassified

Order By: Relevance

“…Fully-supervised Learning Approaches: In the third category, the action segmentation task has been explored by numbers of works by developing various types of network architectures. For example, multi-stream bi-directional recurrent neural network (MSB-RNN) [62], temporal deformable residual network 1. The language signal should not be treated as supervision since the steps are not directly given, but need to be further explored in an unsupervised manner.…”

Section: Methods For Instructional Video Analysismentioning

confidence: 99%

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

Thanks to the substantial and explosively inscreased instructional videos on the Internet, novices are able to acquire knowledge for completing various tasks. Over the past decade, growing efforts have been devoted to investigating the problem on instructional video analysis. However, the most existing datasets in this area have limitations in diversity and scale, which makes them far from many real-world applications where more diverse activities occur. To address this, we present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis. Organized with a hierarchical structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains (e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries. In order to provide a benchmark for instructional video analysis, we evaluate plenty of approaches on the COIN dataset under five different settings. Furthermore, we exploit two important characteristics (i.e., task-consistency and ordering-dependency) for localizing important steps in instructional videos. Accordingly, we propose two simple yet effective methods, which can be easily plugged into conventional proposal-based action detection models. We believe the introduction of the COIN dataset will promote the future in-depth research on instructional video analysis for the community. Our dataset, annotation toolbox and source code are available at http://coin-dataset.github.io. DomainTaskStep VehiclesHousehold Items Change the Car Tire {unscrew the screws, jack up the car, remove the tire, put on the tire, tighten the screws } {remove the door knob, remove bolt and pin board, install new pin board, install new bolt, install new door knob } Replace the Door Knob

show abstract

Section: Methods For Instructional Video Analysismentioning

confidence: 99%

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Tang

Lu²,

Zhou³

2021

IEEE Trans. Pattern Anal. Mach. Intell.

View full text Add to dashboard Cite

show abstract

“…Aside from using handcrafted features, approaches have been introduced using deep networks. Singh et al [45] introduced a multi-stream bi-directional recurrent neural network utilising both spatial and temporal information; while Lea et al [24] incorporates a spatio-temporal CNN with a constrained segmental model. In [23], the authors have introduced temporal convolutional networks (TCN) for fine grained action detection and segmentation.…”

Section: Related Workmentioning

confidence: 99%

“…Here the second GAN is coupled as an auxiliary network, which takes supplementary information. This supplementary information may vary across datasets; for instance we use depth information for the 50 salads dataset [47] and optical flow for MERL shopping [45] and Georgia Tech Egocentric activity [9] datasets. Both GANs aim to generate realistic action codes to fool their respective discriminators using their differing inputs, and the coupled adversarial loss can be defined as,…”

Section: Coupling Multi-model Informationmentioning

confidence: 99%

Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation

Gammulle

Fernando

Denman

et al. 2019

2019 IEEE Winter Conference on Applications of Computer Vision (WACV)

View full text Add to dashboard Cite

We propose a novel conditional GAN (cGAN) model for continuous fine-grained human action segmentation, that utilises multi-modal data and learned scene context information. The proposed approach utilises two GANs: termed Action GAN and Auxiliary GAN, where the Action GAN is trained to operate over the current RGB frame while the Auxiliary GAN utilises supplementary information such as depth or optical flow. The goal of both GANs is to generate similar 'action codes', a vector representation of the current action. To facilitate this process a context extractor that incorporates data and recent outputs from both modes is used to extract context information to aid recognition. The result is a recurrent GAN architecture which learns a task specific loss function from multiple feature modalities. Extensive evaluations on variants of the proposed model to show the importance of utilising different information streams such as context and auxiliary information in the proposed network; and show that our model is capable of outperforming state-of-the-art methods for three widely used datasets: 50 Salads, MERL Shopping and Georgia Tech Egocentric Activities, comprising both static and dynamic camera settings. 1

show abstract

“…RNNs, and in particular Long Short-Term Memorys (LSTMs, which are explained in detail in Sec. 3.2.1) have demonstrated potential in computer vision for analysis of dynamic systems [15,16,17,18,19]. In this study we utilize LSTMs to carefully model the growth patterns of plants.…”

Section: Introductionmentioning

confidence: 99%

“…RNNs (and LSTMs in particular) are able to grasp and learn long-range and complex dynamics and have recently become very popular for the task of activity recognition. More specifically, [15,16,17,18,19] used LSTM in conjunction with CNN for action and activity recognition were shown to provide a significant improvement in performance over previous studies of video data. In this paper, we treat the growth and development of plants as an action recognition problem, and use CNN for extracting discriminative features, and LSTM for encoding the growth behavior of the plants.…”

Section: Introductionmentioning

confidence: 99%

Deep Phenotyping: Deep Learning for Temporal Phenotype/Genotype Classification

Namin

Esmaeilzadeh

Najafi

et al. 2017

Preprint

View full text Add to dashboard Cite

High resolution and high throughput, genotype to phenotype studies in plants are underway to accelerate breeding of climate ready crops. Complex developmental phenotypes are observed by imaging a variety of accessions in different environment conditions, however extracting the genetically heritable traits is challenging. In the recent years, deep learning techniques and in particular Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Long-Short Term Memories (LSTMs), have shown great success in visual data recognition, classification, and sequence learning tasks. In this paper, we proposed a CNN-LSTM framework for plant classification of various genotypes. Here, we exploit the power of deep CNNs for joint feature and classifier learning, within an automatic phenotyping scheme for genotype classification. Further, plant growth variation over time is also important in phenotyping their dynamic behavior. This was fed into the deep learning framework using LSTMs to model these temporal cues for different plant accessions. We generated a replicated dataset of four accessions of Arabidopsis and carried out automated phenotyping experiments. The results provide evidence of the benefits of our approach over using traditional hand-crafted image analysis features and other genotype classification frameworks. We also demonstrate that temporal information further improves the performance of the phenotype classification system.

show abstract

A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection

Cited by 395 publications

References 21 publications

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Comprehensive Instructional Video Analysis: The COIN Dataset and Performance Evaluation

Coupled Generative Adversarial Network for Continuous Fine-Grained Action Segmentation

Deep Phenotyping: Deep Learning for Temporal Phenotype/Genotype Classification

Contact Info

Product

Resources

About