2017
DOI: 10.1109/tcyb.2016.2582918
|View full text |Cite
|
Sign up to set email alerts
|

Benchmarking a Multimodal and Multiview and Interactive Dataset for Human Action Recognition

Abstract: Human action recognition is an active research area in both computer vision and machine learning communities. In the past decades, the machine learning problem has evolved from conventional single-view learning problem, to cross-view learning, cross-domain learning and multitask learning, where a large number of algorithms have been proposed in the literature. Despite having large number of action recognition datasets, most of them are designed for a subset of the four learning problems, where the comparisons … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
46
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 87 publications
(47 citation statements)
references
References 69 publications
1
46
0
Order By: Relevance
“…We extract latent topics from human labeled descriptions as semantic information and introduce an interpretive loss to guide the learning towards interpretable features, which is optimized jointly with the negative log-likelihood of training descriptions. et al [20,21] proposed one original method for joint human action modeling and grouping, which can provide comprehensive information for video caption modeling and explicitly benefit understanding what happens in the given video. As a video is more than a set of static images, in which there are not only the static objects but also the temporal relationships and actions, video analysis often requires more complex network architectures.…”
Section: Overviewmentioning
confidence: 99%
“…We extract latent topics from human labeled descriptions as semantic information and introduce an interpretive loss to guide the learning towards interpretable features, which is optimized jointly with the negative log-likelihood of training descriptions. et al [20,21] proposed one original method for joint human action modeling and grouping, which can provide comprehensive information for video caption modeling and explicitly benefit understanding what happens in the given video. As a video is more than a set of static images, in which there are not only the static objects but also the temporal relationships and actions, video analysis often requires more complex network architectures.…”
Section: Overviewmentioning
confidence: 99%
“…Accuracy SV FV iDT-Tra (BoW) [21] 69.8% 65.8% iDT-COM (BoW) [21] 76.9% 75.3% iDT-COM (FV) [21] 80.7% 79.5% iDT-MBH (BoW) [21] 77.2% 79.6% SFAM-D 71.2% 83.0% SFAM-S 70.1% 75.0% SFAM-RP 79.9% 81.8% SFAM-AMRP 82.2% 78.0% SFAM-LABRP 72.0% 83.7% Max-Score Fusion All 87.6% 88.8% Average-Score Fusion All 88.2% 89.1% Multiply-Score Fusion All 89.4% 91.2% Table 2: Comparison on the M 2 I Dataset for single task scenario (learning and testing in the same view).…”
Section: Methodsmentioning
confidence: 99%
“…Accuracy SV → FV FV → SV iDT-Tra [21] 43.3% 39.2% iDT-COM [21] 70.2% 67.7% iDT-HOG+MBH [21] 75.8% 71.8% iDT-HOG+HOF [21] 78 nel Transform Kernels, end-to-end, with ConvNets from data. Experiments on two benchmark datasets have demonstrated the effectiveness of the proposed method.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…In the existing literature, many datasets are often collected under single camera view [15,34] or multiple views with overlapped observation [29,30,50]. Hence, it is hard to systematically evaluate the robustness of action recognition algorithms on similar yet different backgrounds and captured environments.…”
Section: Introductionmentioning
confidence: 99%