2018
DOI: 10.1109/tpami.2017.2691321
|View full text |Cite
|
Sign up to set email alerts
|

Deep Multimodal Feature Analysis for Action Recognition in RGB+D Videos

Abstract: Single modality action recognition on RGB or depth sequences has been extensively explored recently. It is generally accepted that each of these two modalities has different strengths and limitations for the task of action recognition. Therefore, analysis of the RGB+D videos can help us to better study the complementary properties of these two types of modalities and achieve higher levels of performance. In this paper, we propose a new deep autoencoder based shared-specific feature factorization network to sep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
107
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 210 publications
(120 citation statements)
references
References 73 publications
3
107
0
Order By: Relevance
“…As previously shown [30], deep Conv-Deconv structures are hard to converge. To improve the convergence, we adopted the “U-Net” [27] to add extra connections between the associated convolutional and de-convolutional layers.…”
Section: Methodsmentioning
confidence: 85%
See 1 more Smart Citation
“…As previously shown [30], deep Conv-Deconv structures are hard to converge. To improve the convergence, we adopted the “U-Net” [27] to add extra connections between the associated convolutional and de-convolutional layers.…”
Section: Methodsmentioning
confidence: 85%
“…Recurrent neural networks and long-short-term memory networks (LSTMs) also were used for modeling long-range temporal associations. The ConvNet-LSTM structure was used for activity recognition with different types of input (RGB video, mobile sensor data) [16, 30, 35]. …”
Section: Related Workmentioning
confidence: 99%
“…An intuitive way to combine multimodal features is to directly concatenate them together . To mine more useful information among multimodal features for better performance, researchers propose to explicitly learn shared‐specific structures among features …”
Section: Related Workmentioning
confidence: 99%
“…During the process of encoding, the informative part of information can be merged for a higher distinctive representation, which has been proved effective by many methods and applications. 11,62 To speed up fuse operation, we argue that "prefused" weights could be directly used as initializations for the AE network, due to goal consistency, ie, assigning labels to human actions, between former steps and the fusing step. Specifically, we adopt a pretrained fully connected network accompanied with a small data set D, which not only assigns initial weights In fact, the idea of adopting a fully connected network to settle initial parameters is similar to the spirit of a fully connected layer of LSTM, which successfully helps in transforming the initial weighting process into being one fully connected layer.…”
Section: Fusion Of Heterogeneous Features By Ae Networkmentioning
confidence: 99%
“…Shahroudy et al [7] proposed a shared-specific feature factorization network to separate input multimodal signals into a hierarchy of components. This network achieved much higher accuracy in action recognition of RGB+D videos, but the result is not ideal enough because of the poor performance of the RGB based features for the cross-view task.…”
Section: Introductionmentioning
confidence: 99%