2018
DOI: 10.1145/3131343
|View full text |Cite
|
Sign up to set email alerts
|

A Unified Framework for Multi-Modal Isolated Gesture Recognition

Abstract: In this article, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream, and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework that exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been proposed to explicitly model the long-term structure of the video sequence and to reduce estimation variance whe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
27
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 58 publications
(27 citation statements)
references
References 25 publications
0
27
0
Order By: Relevance
“…Compared with the performances of the first round, the best recognition rate r obtained in round 2 improved considerably (from 56.90% to 67.71% on the test set). We notice that the new baseline [10] also achieved the second best performance. This baseline uses multiple modalities (RGB, depth, optical flow and saliency streams) and a spatio-temporal network architecture, with a consensus-voting strategy (see [10] for details).…”
Section: Results and Methodsmentioning
confidence: 82%
See 2 more Smart Citations
“…Compared with the performances of the first round, the best recognition rate r obtained in round 2 improved considerably (from 56.90% to 67.71% on the test set). We notice that the new baseline [10] also achieved the second best performance. This baseline uses multiple modalities (RGB, depth, optical flow and saliency streams) and a spatio-temporal network architecture, with a consensus-voting strategy (see [10] for details).…”
Section: Results and Methodsmentioning
confidence: 82%
“…We notice that the new baseline [10] also achieved the second best performance. This baseline uses multiple modalities (RGB, depth, optical flow and saliency streams) and a spatio-temporal network architecture, with a consensus-voting strategy (see [10] for details). Table 2 shows a brief summary of each participants/teams' methodology.…”
Section: Results and Methodsmentioning
confidence: 82%
See 1 more Smart Citation
“…For isolated recognition tasks such as Isolated Gesture [10] and Action Recognition [23], most datasets provide Instance Level Annotations that is a single label for each video clip which does not contain any temporal localisation. To train deep networks using instance level annotations, researchers [11,21,27,28] frequently assign the provided instance labels to all time steps and train neural networks using Cross Entropy Loss [17].…”
Section: Introductionmentioning
confidence: 99%
“…To train deep networks using instance level annotations, researchers [11,21,27,28] frequently assign the provided instance labels to all time steps and train neural networks using Cross Entropy Loss [17]. However, identifying every part of a sequence with the same label can cause class ambiguity as different stages of a sequence can have different spatio-temporal features.…”
Section: Introductionmentioning
confidence: 99%