Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Pigou, Lionel; Oord, Aäron van den; Dieleman, Sander; Herreweghe, Mieke Van; Dambre, Joni

doi:10.1007/s11263-016-0957-7

Cited by 213 publications

(128 citation statements)

References 24 publications

(31 reference statements)

Supporting

Mentioning

118

Contrasting

Order By: Relevance

“…As often done in gesture recognition [28] and in NN-based AV speech recognition [23], we consider observations over a short interval (0.2s as in [28,23]) to capture short-term temporal dynamics. Here, a block of 5 visual frames are grouped together (675-dim vector), which corresponds to 20 audio frames (840-dim vector).…”

Section: Multimodal Processingmentioning

confidence: 99%

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Odobez

2016

Proceedings of the 24th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines.

show abstract

Section: Multimodal Processingmentioning

confidence: 99%

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Odobez

2016

Proceedings of the 24th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…The emergence of modern deep learning methods [30,13,34] has removed the need for such tailored representations and enabled systems to implicitly learn both the spatial and the temporal features. However, the disadvantage of deep learning is that it can be difficult to encode expert knowledge (such as suitable subunits or intermediate representations).…”

Section: Introductionmentioning

confidence: 99%

SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition

Camgöz

Hadfield

Koller

et al. 2017

2017 IEEE International Conference on Computer Vision (ICCV)

244

113

View full text Add to dashboard Cite

We propose a novel deep learning approach to solve simultaneous alignment and recognition problems (referred to as "Sequence-to-sequence" learning). We decompose the problem into a series of specialised expert systems referred to as SubUNets. The spatio-temporal relationships between these SubUNets are then modelled to solve the task, while remaining trainable end-to-end.The approach mimics human learning and educational techniques, and has a number of significant advantages. SubUNets allow us to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. They also allow us to implicitly perform transfer learning between different interrelated tasks, which also allows us to exploit a wider range of more varied data sources.In our experiments we demonstrate that each of these properties serves to significantly improve the performance of the overarching recognition system, by better constraining the learning problem.The proposed techniques are demonstrated in the challenging domain of sign language recognition. We demonstrate state-of-the-art performance on hand-shape recognition (outperforming previous techniques by more than 30%). Furthermore, we are able to obtain comparable sign recognition rates to previous research, without the need for an alignment step to segment out the signs for recognition.

show abstract

“…With the emergence of consumer depth cameras [5], researchers quickly incorporated depth sensors into their systems, as depth simplifies the task of human pose estimation [6]. Many state-of-the-art gesture recognition systems today use depth images as a modality or as a means of preprocessing their data before recognizing gestures [2], [7], [8].…”

Section: Introductionmentioning

confidence: 99%

“…Since the pioneering work of Starner and Pentland [3], Hidden Markov Models have often been used for gesture recognition [13], [14], [15]. Other graphical models such as Hidden Conditional Random Fields [16], Autoregressive Models [17] and Recurrent Neural Networks [8], [18] have also been deployed for the gesture recognition task.…”

Section: Introductionmentioning

confidence: 99%

“…In recent years, Convolutional Neural Network (CNN) based approaches have achieved state-of-the-art performance in gesture recognition challenges [7], [8]. In [7], Neverova et al proposed a multi-scale and multi-modal deep learning architecture to spot and recognize continuous gestures, and achieved state-of-the-art performance in the ChaLearn 2014 Gesture Recognition challenge [22].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition

Camgöz

Hadfield

Koller

et al. 2016

2016 23rd International Conference on Pattern Recognition (ICPR)

View full text Add to dashboard Cite

Abstract-In this paper, we propose using 3D Convolutional Neural Networks for large scale user-independent continuous gesture recognition. We have trained an end-to-end deep network for continuous gesture recognition (jointly learning both the feature representation and the classifier). The network performs three-dimensional (i.e. space-time) convolutions to extract features related to both the appearance and motion from volumes of color frames. Space-time invariance of the extracted features is encoded via pooling layers. The earlier stages of the network are partially initialized using the work of Tran et al. before being adapted to the task of gesture recognition. An earlier version of the proposed method, which was trained for 11,250 iterations, was submitted to ChaLearn 2016 Continuous Gesture Recognition Challenge and ranked 2nd with the Mean Jaccard Index Score of 0.269235. When the proposed method was further trained for 28,750 iterations, it achieved state-of-the-art performance on the same dataset, yielding a 0.314779 Mean Jaccard Index Score.

show abstract

Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video

Cited by 213 publications

References 24 publications

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

SubUNets: End-to-End Hand Shape and Continuous Sign Language Recognition

Using Convolutional 3D Neural Networks for User-independent continuous gesture recognition

Contact Info

Product

Resources

About