2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2017
DOI: 10.1109/iros.2017.8206288
|View full text |Cite
|
Sign up to set email alerts
|

Two-stream RNN/CNN for action recognition in 3D videos

Abstract: Abstract-The recognition of actions from video sequences has many applications in health monitoring, assisted living, surveillance, and smart homes. Despite advances in sensing, in particular related to 3D video, the methodologies to process the data are still subject to research. We demonstrate superior results by a system which combines recurrent neural networks with convolutional neural networks in a voting approach. The gated-recurrent-unit-based neural networks are particularly well-suited to distinguish … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
57
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 86 publications
(57 citation statements)
references
References 64 publications
(65 reference statements)
0
57
0
Order By: Relevance
“…Usually, a first stage of convolutional layers extract features from the raw data and generate high-level representations in deeper layers, then a second stage of recurrent layers uses the features yielded by the convolutional layers to learn time dependencies. Examples of applications are action recognition in videos (Sainath et al, 2015;Donahue et al, 2017;Zhao et al, 2017a) and speech recognition (Zhao et al, 2017b).…”
Section: Introductionmentioning
confidence: 99%
“…Usually, a first stage of convolutional layers extract features from the raw data and generate high-level representations in deeper layers, then a second stage of recurrent layers uses the features yielded by the convolutional layers to learn time dependencies. Examples of applications are action recognition in videos (Sainath et al, 2015;Donahue et al, 2017;Zhao et al, 2017a) and speech recognition (Zhao et al, 2017b).…”
Section: Introductionmentioning
confidence: 99%
“…In such architectures, spatial information is extracted though CNNs and is then passed to recurrent networks for learning the temporal characteristics of each interaction class [6,27]. Zhao et al [170] proposed an approach based on the normalization of each layer of the network with batch normalization [57]. The created architecture is combined with a 3-dimensional ConvNet by using a two-stream fusion of the RNN and ConvNet, with an SVM.…”
Section: Recurrent Networkmentioning
confidence: 99%
“…There are two methods for recognizing specific user states such as falling and tripping using the obtained joint information. First, a study on action recognition in 3D video [30] and a study on skeleton extraction [31] implemented a deep-learning model that can recognize falling. However, to recognize motions, it can be challenging to crop only the target motion and use it as an input.…”
Section: (A)mentioning
confidence: 99%