2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00126
|View full text |Cite
|
Sign up to set email alerts
|

Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training

Abstract: We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
75
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 150 publications
(78 citation statements)
references
References 47 publications
0
75
0
Order By: Relevance
“…Therefore, λ was set to 1.0 in this work. A recently proposed study reported that the value of focal regularization parameter (similar to λ in the present study) for the regularization of network from different modalities could be adaptively adjusted based on the difference between the loss value related to each modality and the output loss value. It can be anticipated that the adaptive manner that using different λ for the different DW images will yield improved performance.…”
Section: Discussionmentioning
confidence: 99%
“…Therefore, λ was set to 1.0 in this work. A recently proposed study reported that the value of focal regularization parameter (similar to λ in the present study) for the regularization of network from different modalities could be adaptively adjusted based on the difference between the loss value related to each modality and the output loss value. It can be anticipated that the adaptive manner that using different λ for the different DW images will yield improved performance.…”
Section: Discussionmentioning
confidence: 99%
“…Color information is used for most high performance hand detection models [27]. However, color information cannot be used in the tabletop holographic display environment.…”
Section: Proposed Gesture Interactionmentioning
confidence: 99%
“…The early fusion and late fusion are currently the most common fusion techniques when facing multimodal data, so we use early and late fusion strategy with ResNet-18 backbone as baseline. We also compare with the ResNet-18 with channel attention mechanism (CAM), and the stateof-the-art multimodal methods: MTUT [1] and TEMT-Net [26]. MTUT is designed to improve the testing performance in hand gesture recognition task by encouraging the networks to learn a common understanding across different modalities while avoiding negative transfer.…”
Section: Comparison With Multimodal Based Methodsmentioning
confidence: 99%
“…Multimodal machine learning aims to build models that can process, correlate, and integrate information from multiple modalities [2]. The success of multimodal machine learning has been demonstrated in a wide range of applications, e,g, human action analysis [1,4,37,38] , person/object localization and tracking [15,34,47] and image segmentation [14,51].…”
Section: Multimodal Machine Learningmentioning
confidence: 99%