We present an efficient approach for leveraging the knowledge from multiple modalities in training unimodal 3D convolutional neural networks (3D-CNNs) for the task of dynamic hand gesture recognition. Instead of explicitly combining multimodal information, which is commonplace in many state-of-the-art methods, we propose a different framework in which we embed the knowledge of multiple modalities in individual networks so that each unimodal network can achieve an improved performance. In particular, we dedicate separate networks per available modality and enforce them to collaborate and learn to develop networks with common semantics and better representations. We introduce a "spatiotemporal semantic alignment" loss (SSA) to align the content of the features from different networks. In addition, we regularize this loss with our proposed "focal regularization parameter" to avoid negative knowledge transfer. Experimental results show that our framework improves the test time recognition accuracy of unimodal networks, and provides the state-of-the-art performance on various dynamic hand gesture recognition datasets.
Exemplar-based learning or, equally, nearest neighbor methods have recently gained interest from researchers in a variety of computer science domains because of the prevalence of large amounts of accessible data and storage capacity. In computer vision, these types of technique have been successful in several problems such as scene recognition, shape matching, image parsing, character recognition, and object detection. Applying the concept of exemplar-based learning to the problem of color constancy seems odd at first glance since, in the first place, similar nearest neighbor images are not usually affected by precisely similar illuminants and, in the second place, gathering a dataset consisting of all possible real-world images, including indoor and outdoor scenes and for all possible illuminant colors and intensities, is indeed impossible. In this paper, we instead focus on surfaces in the image and address the color constancy problem by unsupervised learning of an appropriate model for each training surface in training images. We find nearest neighbor models for each surface in a test image and estimate its illumination based on comparing the statistics of pixels belonging to nearest neighbor surfaces and the target surface. The final illumination estimation results from combining these estimated illuminants over surfaces to generate a unique estimate. We show that it performs very well, for standard datasets, compared to current color constancy algorithms, including when learning based on one image dataset is applied to tests from a different dataset. The proposed method has the advantage of overcoming multi-illuminant situations, which is not possible for most current methods since they assume the color of the illuminant is constant all over the image. We show a technique to overcome the multiple illuminant situation using the proposed method and test our technique on images with two distinct sources of illumination using a multiple-illuminant color constancy dataset. The concept proposed here is a completely new approach to the color constancy problem and provides a simple learning-based framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.