VDARN: Video Disentangling Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

Su, Yong; Xing, Meng; An, Simin; Peng, Weilong; Feng, Zhiyong

doi:10.1016/j.adhoc.2020.102380

Cited by 15 publications

(17 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The former assumes that only the labeled videos from the seen categories are available during training while the latter can use the unlabeled data of the unseen categories for model training. Specifically, in this work, we focus on inductive ZSAR [12], [15], [26], [42] and do not discuss the transductive approach [9], [32].…”

Section: Methodsmentioning

confidence: 99%

“…For example, Jain et al [14] extract objects from action videos by a pretrained object classifier and embed object names to form a visual representation. In [15], the object information and pose information are simultaneously extracted to form visual feature. However, both of them utilize the pretrained object classifier at test time, which is questionable as illustrated in the introduction.…”

Section: A Zero-shot Action Recognitionmentioning

confidence: 99%

“…For instance, in [14], a pretrained object classifier is employed to recognize objects from action videos, and embedding of object names is treated as the visual representation without considering spatio-temporal information of the actions. In [15], objects are also extracted and the embedding of their names is concatenated to the feature representing poses to form the visual feature. Since such visual feature carries some amount of semantic information, the semantic gap is expected to be narrowed.…”

mentioning

confidence: 99%

“…It has to be pointed out that both [14] and [15] require an object classifier to extract objects and word-embed their names during testing. The classifier is pretrained on a large dataset such as the ImageNet [16], this practice has raised a question of validity of their methods being truly ZSAR because the large-scale dataset that is used to train the object classifier likely contains images high-related to unseen action classes (see Fig.…”

mentioning

confidence: 99%

“…Specifically, objects are annotated or extracted offline from seen actions and their names are word-embedded into a vector in the visual space as privileged information (PI) in training. Unlike the methods in [14] and [15], our method does not need the object classifier during testing phase, instead it uses a hallucination network to mimic the extraction of related semantic information. The output of the hallucination network is fused with the visual feature by a cross-attention module to narrow the semantic gap and assist the mapping from visual feature to the semantic space.…”

mentioning

confidence: 99%

See 4 more Smart Citations