2017 IEEE International Conference on Computer Vision Workshops (ICCVW) 2017
DOI: 10.1109/iccvw.2017.360
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Gesture Recognition Based on the ResC3D Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
89
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 136 publications
(95 citation statements)
references
References 35 publications
0
89
0
Order By: Relevance
“…As a method of fusing both modalities, CorrNet should be investigated against other fusion methods. Thus, we compared with the averaging (ava), maximum (max), multiply [1] [13], and Canonical Correlation Analysis (CCA) [14] fusion method using same BN-Inception network. As in Table IV, CorrNet outperforms ava, max, multiply, and CCA by 0.7%, 2.9%, 0.9%, and 0.9%, respectively on UCF101 split 1.…”
Section: Methodsmentioning
confidence: 99%
“…As a method of fusing both modalities, CorrNet should be investigated against other fusion methods. Thus, we compared with the averaging (ava), maximum (max), multiply [1] [13], and Canonical Correlation Analysis (CCA) [14] fusion method using same BN-Inception network. As in Table IV, CorrNet outperforms ava, max, multiply, and CCA by 0.7%, 2.9%, 0.9%, and 0.9%, respectively on UCF101 split 1.…”
Section: Methodsmentioning
confidence: 99%
“…Camgoz et al [6] suggested a user-independent system based on the spatiotemporal encoding of 3D-CNNs. Miao et al proposed ResC3D [23], a 3D-CNN architecture that combines multimodal data and exploits an attention model. Furthermore, some CNN-based models also use recurrent architectures to capture the temporal information [50,8,11,52].…”
Section: Related Workmentioning
confidence: 99%
“…Similarly, Wang et al [150] used a two-stream semantic region based CNN (SR-CNNs) as an extension of Faster R-CNNs [105]. The idea of using multiple independent or dependent regions for various cues, and using separate streams to encode the input, also allows to focus on discriminative regions such as the motion of a body part [124,87,142,153]. Typically, the regions complement each other, which provide the efficient foreground extraction and localization of the per-frame motion.…”
Section: Motion-based and Stream Networkmentioning
confidence: 99%
“…Input Frame Figure 4: Video classification networks: (i) 3D-convolution [58], (ii) 2D-Convolutional LSTM over a sequence of frames [29], (iii) 3D LSTM [5], (iv) slow-fusion [62], (v) two/multi-stream CNN [122,150,87,142] and (vi) two-stream 3D-Conv network [15].…”
Section: D Convolution 3d Convolution Classification Lstmmentioning
confidence: 99%