2016
DOI: 10.1007/s11263-016-0957-7
|View full text |Cite
|
Sign up to set email alerts
|

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Abstract: Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
118
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 213 publications
(128 citation statements)
references
References 24 publications
(31 reference statements)
0
118
0
Order By: Relevance
“…As often done in gesture recognition [28] and in NN-based AV speech recognition [23], we consider observations over a short interval (0.2s as in [28,23]) to capture short-term temporal dynamics. Here, a block of 5 visual frames are grouped together (675-dim vector), which corresponds to 20 audio frames (840-dim vector).…”
Section: Multimodal Processingmentioning
confidence: 99%
“…As often done in gesture recognition [28] and in NN-based AV speech recognition [23], we consider observations over a short interval (0.2s as in [28,23]) to capture short-term temporal dynamics. Here, a block of 5 visual frames are grouped together (675-dim vector), which corresponds to 20 audio frames (840-dim vector).…”
Section: Multimodal Processingmentioning
confidence: 99%
“…The emergence of modern deep learning methods [30,13,34] has removed the need for such tailored representations and enabled systems to implicitly learn both the spatial and the temporal features. However, the disadvantage of deep learning is that it can be difficult to encode expert knowledge (such as suitable subunits or intermediate representations).…”
Section: Introductionmentioning
confidence: 99%
“…With the emergence of consumer depth cameras [5], researchers quickly incorporated depth sensors into their systems, as depth simplifies the task of human pose estimation [6]. Many state-of-the-art gesture recognition systems today use depth images as a modality or as a means of preprocessing their data before recognizing gestures [2], [7], [8].…”
Section: Introductionmentioning
confidence: 99%
“…Since the pioneering work of Starner and Pentland [3], Hidden Markov Models have often been used for gesture recognition [13], [14], [15]. Other graphical models such as Hidden Conditional Random Fields [16], Autoregressive Models [17] and Recurrent Neural Networks [8], [18] have also been deployed for the gesture recognition task.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation