2017
DOI: 10.1007/s00138-017-0871-1
|View full text |Cite
|
Sign up to set email alerts
|

Scale coding bag of deep features for human attribute and action recognition

Abstract: Most approaches to human attribute and action recognition in still images are based on image representation in which multi-scale local features are pooled across scale into a single, scale-invariant encoding. Both in bagof-words and the recently popular representations based on convolutional neural networks, local features are computed at multiple scales. However, these multi-scale convolutional features are pooled into a single scale-invariant representation. We argue that entirely scale-invariant image repre… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 26 publications
(12 citation statements)
references
References 64 publications
0
12
0
Order By: Relevance
“…Yan, Smith and Zhang (2017) encode the CNN image patches features generated by a region proposal algorithm with VLAD(Vector of Locally Aggregated Descriptors) and subsequently represent an image by a compact code, which the authors claim that it captures more fine-grained properties of the images and also contains global contextual information. Khan et al (2018) investigate approaches to scale coding within a bag of deep features framework, demonstrating that both absolute and relative scale information can be encoded in final image representations and that relaxing the traditional scale invariance commonly employed in image classification can lead to significant gains in recognition performance, achieving state-of-the-art-results. Nevertheless, most of these approaches concentrate on the classification task, disregarding the detection task, while real-world images often have multiple actions being performed, where action detection plays a significant role.…”
Section: Still Image Action Recognitionmentioning
confidence: 99%
See 2 more Smart Citations
“…Yan, Smith and Zhang (2017) encode the CNN image patches features generated by a region proposal algorithm with VLAD(Vector of Locally Aggregated Descriptors) and subsequently represent an image by a compact code, which the authors claim that it captures more fine-grained properties of the images and also contains global contextual information. Khan et al (2018) investigate approaches to scale coding within a bag of deep features framework, demonstrating that both absolute and relative scale information can be encoded in final image representations and that relaxing the traditional scale invariance commonly employed in image classification can lead to significant gains in recognition performance, achieving state-of-the-art-results. Nevertheless, most of these approaches concentrate on the classification task, disregarding the detection task, while real-world images often have multiple actions being performed, where action detection plays a significant role.…”
Section: Still Image Action Recognitionmentioning
confidence: 99%
“…Our evaluation results are obtained on the testing images through the "boxless action classification" of the publicly available competition server and the mAP is computed for evaluation. Table 2 presents the comparison results with the state-of-the-art approaches (Gkioxari et al (2015), Qi et al (2017) and Khan et al (2018)) on the VOC Action 2012 test set. As shown, general person detectors perform worse in this dataset, we suspect that this occurs due to the fact that in this dataset humans interact with larger objects, such as bicycles and horses, and this lack of information proved to be prejudicial to recognition, sustaining our hypothesis that the human-object interaction is fundamental for action recognition.…”
Section: Pascal Voc 2012 Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…In the scale-coded image representation, feature scales are divided into several scale subgroups S t that partition the whole set of extracted scales (i.e. [21]. The multi-scale convolutional features are then pooled using Fisher vector (FV) encoding scheme.…”
Section: B Scale-coded Two-stream Deep Representationmentioning
confidence: 99%
“…As discussed earlier, deep features extracted from activations of the fully connected (FC) layers of the deep CNNs are typically used for image representation. Instead of FC layers, activations from the last convolutional layer of the deep networks have been shown to provide excellent performance in recent works [19], [20], [21]. The convolutional layers are known to be discriminative and semantically meaningful.…”
Section: Introductionmentioning
confidence: 99%