Proceedings of the 15th ACM on International Conference on Multimodal Interaction 2013
DOI: 10.1145/2522848.2532593
|View full text |Cite
|
Sign up to set email alerts
|

A multi-modal gesture recognition system using audio, video, and skeletal joint data

Abstract: This paper describes the gesture recognition system developed by the Institute for Infocomm Research (I 2 R) for the 2013 ICMI CHALEARN Multi-modal Gesture Recognition Challenge. The proposed system adopts a multi-modal approach for detecting as well as recognizing the gestures. Automated gesture detection is performed using both audio signals and information about hand joints obtained from the Kinect sensor to segment a sample into individual gestures. Once the gestures are detected and segmented, features ex… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
0
1

Year Published

2014
2014
2019
2019

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 23 publications
(16 citation statements)
references
References 12 publications
0
15
0
1
Order By: Relevance
“…In [27], skeletal information was integrated in two ways for extracting HoG features from RGB and depth images: either from global bounding boxes containing a whole body or from regions containing an arm, a torso and a head. Similarly, [28], [29], [30] fused skeletal information with HoG features extracted from either RGB or depth, while [31] proposed a combination of a covariance descriptor representing skeletal joint data with spatio-temporal interest points extracted from RGB augmented with audio.…”
Section: Gesture Recognitionmentioning
confidence: 99%
“…In [27], skeletal information was integrated in two ways for extracting HoG features from RGB and depth images: either from global bounding boxes containing a whole body or from regions containing an arm, a torso and a head. Similarly, [28], [29], [30] fused skeletal information with HoG features extracted from either RGB or depth, while [31] proposed a combination of a covariance descriptor representing skeletal joint data with spatio-temporal interest points extracted from RGB augmented with audio.…”
Section: Gesture Recognitionmentioning
confidence: 99%
“…Supervised learning uses ground truth data for optimizing the training process. Unsupervised learning includes extracting the invariant spatio-temporal features from the videos using independent subspace analysis (ISA), autoencoders, or some other variant network [3,39,35]. Convolutional Restricted Boltzmann Machines (RBMs) have also been used for generating feature representations of the video frames [9].…”
Section: Related Workmentioning
confidence: 99%
“…Laptev et al [18] use non-linear SVMs for the task of recognizing daily activities of small temporal length (answer the phone, sit down/up, kiss, hug, get out of car). Similar, authors in [29] use SVMs on temporal and timeweighted variances, and authors in [21] employ SVMs in RGB and Depth data to recover gestures, and then apply a fusion scheme using inferred motion and audio, in a multimodal environment. Authors in [14] have also utilized SVMs for activity feature classification, on joint orientation angles and their forward differences, while view-invariant features (normalized between-joint distances orientations and velocities) have been employed in [28].…”
Section: Related Workmentioning
confidence: 99%