Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training

Abavisani, Mahdi; Joze, Hamid Reza Vaezi; Patel, Vishal M.

doi:10.1109/cvpr.2019.00126

Cited by 150 publications

(78 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Therefore, λ was set to 1.0 in this work. A recently proposed study reported that the value of focal regularization parameter (similar to λ in the present study) for the regularization of network from different modalities could be adaptively adjusted based on the difference between the loss value related to each modality and the output loss value. It can be anticipated that the adaptive manner that using different λ for the different DW images will yield improved performance.…”

Section: Discussionmentioning

confidence: 99%

Grading of hepatocellular carcinoma based on diffusion weighted images with multiple b‐values using convolutional neural networks

et al. 2019

View full text Add to dashboard Cite

Purpose To effectively grade hepatocellular carcinoma (HCC) based on deep features derived from diffusion weighted images (DWI) with multiple b‐values using convolutional neural networks (CNN). Materials and Methods Ninety‐eight subjects with 100 pathologically confirmed HCC lesions from July 2012 to October 2018 were included in this retrospective study, including 47 low‐grade and 53 high‐grade HCCs. DWI was performed for each subject with a 3.0T MR scanner in a breath‐hold routine with three b‐values (0,100, and 600 s/mm2). First, logarithmic transformation was performed on original DWI images to generate log maps (logb0, logb100, and logb600). Then, a resampling method was performed to extract multiple 2D axial planes of HCCs from the log map to increase the dataset for training. Subsequently, 2D CNN was used to extract deep features of the log map for HCCs. Finally, fusion of deep features derived from three b‐value log maps was conducted for HCC malignancy classification. Specifically, a deeply supervised loss function was devised to further improve the performance of lesion characterization. The data set was split into two parts: the training and validation set (60 HCCs) and the fixed test set (40 HCCs). Four‐fold cross validation with 10 repetitions was performed to assess the performance of deep features extracted from single b‐value images for HCC grading using the training and validation set. Receiver operating characteristic curve (ROC) and area under the curve (AUC) values were used to assess the characterization performance of the proposed deep feature fusion method to differentiate low‐grade and high‐grade in the fixed test set. Results The proposed fusion of deep features derived from logb0, logb100, and logb600 with deeply supervised loss function generated the highest accuracy for HCC grading (80%), thus outperforming the method of deep feature derived from the ADC map directly (72.5%), the original b0 (65%), b100 (68%), and b600 (70%) images. Furthermore, AUC values of the deep features of the ADC map, the deep feature fusion with concatenation, and the proposed deep feature fusion with deeply supervised loss function were 0.73, 0.78, and 0.83, respectively. Conclusion The proposed fusion of deep features derived from the logarithm of the three b‐value images yields high performance for HCC grading, thus providing a promising approach for the assessment of DWI in lesion characterization.

show abstract

Section: Discussionmentioning

confidence: 99%

Grading of hepatocellular carcinoma based on diffusion weighted images with multiple b‐values using convolutional neural networks

et al. 2019

View full text Add to dashboard Cite

show abstract

“…Color information is used for most high performance hand detection models [27]. However, color information cannot be used in the tabletop holographic display environment.…”

Section: Proposed Gesture Interactionmentioning

confidence: 99%

Development of Real-Time Hand Gesture Recognition for Tabletop Holographic Display Interaction Using Azure Kinect

Lee

Kim

Cho

et al. 2020

Sensors

View full text Add to dashboard Cite

The use of human gesturing to interact with devices such as computers or smartphones has presented several problems. This form of interaction relies on gesture interaction technology such as Leap Motion from Leap Motion, Inc, which enables humans to use hand gestures to interact with a computer. The technology has excellent hand detection performance, and even allows simple games to be played using gestures. Another example is the contactless use of a smartphone to take a photograph by simply folding and opening the palm. Research on interaction with other devices via hand gestures is in progress. Similarly, studies on the creation of a hologram display from objects that actually exist are also underway. We propose a hand gesture recognition system that can control the Tabletop holographic display based on an actual object. The depth image obtained using the latest Time-of-Flight based depth camera Azure Kinect is used to obtain information about the hand and hand joints by using the deep-learning model CrossInfoNet. Using this information, we developed a real time system that defines and recognizes gestures indicating left, right, up, and down basic rotation, and zoom in, zoom out, and continuous rotation to the left and right.

show abstract

“…The early fusion and late fusion are currently the most common fusion techniques when facing multimodal data, so we use early and late fusion strategy with ResNet-18 backbone as baseline. We also compare with the ResNet-18 with channel attention mechanism (CAM), and the stateof-the-art multimodal methods: MTUT [1] and TEMT-Net [26]. MTUT is designed to improve the testing performance in hand gesture recognition task by encouraging the networks to learn a common understanding across different modalities while avoiding negative transfer.…”

Section: Comparison With Multimodal Based Methodsmentioning

confidence: 99%

“…Multimodal machine learning aims to build models that can process, correlate, and integrate information from multiple modalities [2]. The success of multimodal machine learning has been demonstrated in a wide range of applications, e,g, human action analysis [1,4,37,38] , person/object localization and tracking [15,34,47] and image segmentation [14,51].…”

Section: Multimodal Machine Learningmentioning

confidence: 99%

Adaptive Multimodal Fusion for Facial Action Units Recognition

Yang

Wang

Liu

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel Adaptive Multimodal Fusion method (AMF) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding; Biometrics; Image representations.

show abstract

Improving the Performance of Unimodal Dynamic Hand-Gesture Recognition With Multimodal Training

Cited by 150 publications

References 47 publications

Grading of hepatocellular carcinoma based on diffusion weighted images with multiple b‐values using convolutional neural networks

Grading of hepatocellular carcinoma based on diffusion weighted images with multiple b‐values using convolutional neural networks

Development of Real-Time Hand Gesture Recognition for Tabletop Holographic Display Interaction Using Azure Kinect

Adaptive Multimodal Fusion for Facial Action Units Recognition

Contact Info

Product

Resources

About