Multimodal Gesture Recognition Based on the ResC3D Network

Miao, Qiguang; Li, Yunan; Ouyang, Wanli; Ma, Zhenxin; Xu, Xin; Shi, Weikang; Cao, Xiaochun

doi:10.1109/iccvw.2017.360

Cited by 136 publications

(95 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a method of fusing both modalities, CorrNet should be investigated against other fusion methods. Thus, we compared with the averaging (ava), maximum (max), multiply [1] [13], and Canonical Correlation Analysis (CCA) [14] fusion method using same BN-Inception network. As in Table IV, CorrNet outperforms ava, max, multiply, and CCA by 0.7%, 2.9%, 0.9%, and 0.9%, respectively on UCF101 split 1.…”

Section: Methodsmentioning

confidence: 99%

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Yudistira

Kurita

2020

Signal Processing: Image Communication

View full text Add to dashboard Cite

This letter describes a network that is able to capture multimodal correlations over arbitrary timestamps. The proposed scheme operates as a complementary, extended network over multimodal CNN. For action recognition, the spatial and temporal streams are vital components of deep Convolutional Neural Network (CNNs), but reducing the occurrence of overfitting and fusing these two streams remain open problems. The existing fusion approach is to average the two streams. To this end, we propose a correlation network with a Shannon fusion to learn a CNN that has already been trained. Long-range video may consist of spatiotemporal correlation over arbitrary times. This correlation can be captured using simple fully connected layers to form the correlation network. This is found to be complementary to the existing network fusion methods. We evaluate our approach on the UCF-101 and HMDB-51 datasets, and the resulting improvement in accuracy demonstrates the importance of multimodal correlation.

show abstract

Section: Methodsmentioning

confidence: 99%

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Yudistira

Kurita

2020

Signal Processing: Image Communication

View full text Add to dashboard Cite

show abstract

“…Camgoz et al [6] suggested a user-independent system based on the spatiotemporal encoding of 3D-CNNs. Miao et al proposed ResC3D [23], a 3D-CNN architecture that combines multimodal data and exploits an attention model. Furthermore, some CNN-based models also use recurrent architectures to capture the temporal information [50,8,11,52].…”

Section: Related Workmentioning

confidence: 99%

Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training

Abavisani

Joze

Patel

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

146

View full text Add to dashboard Cite

We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.

show abstract

“…Similarly, Wang et al [150] used a two-stream semantic region based CNN (SR-CNNs) as an extension of Faster R-CNNs [105]. The idea of using multiple independent or dependent regions for various cues, and using separate streams to encode the input, also allows to focus on discriminative regions such as the motion of a body part [124,87,142,153]. Typically, the regions complement each other, which provide the efficient foreground extraction and localization of the per-frame motion.…”

Section: Motion-based and Stream Networkmentioning

confidence: 99%

“…Input Frame Figure 4: Video classification networks: (i) 3D-convolution [58], (ii) 2D-Convolutional LSTM over a sequence of frames [29], (iii) 3D LSTM [5], (iv) slow-fusion [62], (v) two/multi-stream CNN [122,150,87,142] and (vi) two-stream 3D-Conv network [15].…”

Section: D Convolution 3d Convolution Classification Lstmmentioning

confidence: 99%

Analyzing human–human interactions: A survey

Stergiou

Poppe

2019

Computer Vision and Image Understanding

View full text Add to dashboard Cite

Many videos depict people, and it is their interactions that inform us of their activities, relation to one another and the cultural and social setting. With advances in human action recognition, researchers have begun to address the automated recognition of these human-human interactions from video. The main challenges stem from dealing with the considerable variation in recording settings, the appearance of the people depicted and the performance of their interaction. This survey provides a summary of these challenges and datasets, followed by an in-depth discussion of relevant vision-based recognition and detection methods. We focus on recent, promising work based on convolutional neural networks (CNNs). Finally, we outline directions to overcome the limitations of the current state-of-the-art. Main challenges in the fieldWe identify challenges when dealing with the visual and structural aspects of interaction videos. Additionally, we outline practical challenges in the development of methods of automated human-human action recognition.

show abstract

Multimodal Gesture Recognition Based on the ResC3D Network

Cited by 136 publications

References 35 publications

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Correlation Net: Spatiotemporal multimodal deep learning for action recognition

Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training

Analyzing human–human interactions: A survey

Contact Info

Product

Resources

About