Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories. To address those challenges, we propose a multi-level relation detection strategy that utilizes human pose cues to capture global spatial configurations of relations and as an attention mechanism to dynamically zoom into relevant regions at human part level. Specifically, we develop a multi-branch deep network to learn a pose-augmented relation representation at three semantic levels, incorporating interaction context, object features and detailed semantic part cues. As a result, our approach is capable of generating robust predictions on fine-grained human object interactions with interpretable outputs. Extensive experimental evaluations on public benchmarks show that our model outperforms prior methods by a considerable margin, demonstrating its efficacy in handling complex scenes. Code is available at https://github.com/bobwan1995/PMFNet.
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion. Existing approaches typically employ a single neural representation for different motion patterns, which has difficulty in capturing fine-grained action classes given limited training data. To address the aforementioned problems, we propose a novel multi-granular spatiotemporal graph network for skeleton-based action classification that jointly models the coarse-and fine-grained skeleton motion patterns. To this end, we develop a dual-head graph network consisting of two interleaved branches, which enables us to extract features at two spatio-temporal resolutions in an effective and efficient manner. Moreover, our network utilises a cross-head communication strategy to mutually enhance the representations of both heads. We conducted extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, and achieves the state-of-the-art performance on all the benchmarks, which validates the effectiveness of our method 1 .
CCS CONCEPTS• Computing methodologies → Activity recognition and understanding.
Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. Existing state-of-the-art methods typically rely on the holistic video representations for score regression or ranking, which limits the generalization to capture fine-grained intra-class variation. To overcome the above limitation, we propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Specifically, we utilize a set of learnable queries to represent the atomic temporal patterns for a specific action. Our decoding process converts the frame representations to a fixed number of temporally ordered part representations. To obtain the quality score, we adopt the state-of-the-art contrastive regression based on the part representations. Since existing AQA datasets do not provide temporal part-level labels or partitions, we propose two novel loss functions on the cross attention responses of the decoder: a ranking loss to ensure the learnable queries to satisfy the temporal order in cross attention and a sparsity loss to encourage the part representations to be more discriminative. Extensive experiments show that our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
Human instance segmentation is a core problem for human-centric scene understanding and segmenting human instances poses a unique challenge to vision systems due to large intra-class variations in both appearance and shape, and complicated occlusion patterns. In this paper, we propose a new poseaware human instance segmentation method. Compared to the previous pose-aware methods which first predict bottom-up poses and then estimate instance segmentation on top of predicted poses, our method integrates both top-down and bottom-up cues for an instance: it adopts detection results as human proposals and jointly estimates human pose and instance segmentation for each proposal. We develop a modular recurrent deep network that utilizes pose estimation to refine instance segmentation in an iterative manner. Our refinement modules exploit pose cues in two levels: as a coarse shape prior and local part attention. We evaluate our approach on two public multi-person benchmarks: OCHuman dataset and COCOPersons dataset. The proposed method surpasses the state-of-the-art methods on OCHuman dataset by 3.0 mAP and on COCOPersons by 6.4 mAP, demonstrating the effectiveness of our approach.
In this paper we propose a novel regression based RGBD crowd counting method. Compared with previous RGBD crowd counting methods which mainly exploit depth cue to facilitate person/head detection, our approach adopts density map regression and is more robust to severe occlusion under dense crowded scenarios. We develop a cascaded depth-aware counting network that jointly performs head segmentation and density map regression. Our network explicitly feeds depth map at each stage so that depth cues are sufficiently exploited. The multi-task strategy allows the network to explicitly attent to foreground regions of a crowd scene and improve density regression. To generate the ground truth of head segmentation and density map, we propose a head scale estimation method according to the basic geometric assumption and camera projection function. Experiments on two public RGBD crowd counting benchmarks, ShanghaiTechRGBD dataset and MICC dataset show that the proposed method achieves new state-of-theart on both datasets. Further, our method can be easily extended to RGB datasets and achieves comparable performances on WorldExpo'10 dataset and UCF-QNRF dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.