Multi-view dual attention network for 3D object recognition

Wang, Wenju; Cai, Yu; Wang, Tao

doi:10.1007/s00521-021-06588-1

Cited by 19 publications

(18 citation statements)

References 39 publications

(94 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…All the above methods only model local pose dynamics, ignoring global body translation and inter-individual body interaction. However, learning both local and global pose dynamics and modeling fine-grained human-human interaction are essential for comprehending human behavior in a complex 3D environment [2,39].…”

Section: Single-person Pose Forecastingmentioning

confidence: 99%

“…Guo et al [15] present a collaborative prediction task and use a two-branch attention network for the prediction of two interacted persons. Wang et al [39] present a Transformer-based framework to forecast multi-person motion in a scenario with more people. Furthermore, this method produces unrealistic poses since they solely concen-Figure 2.…”

Section: Multi-person Pose Forecastingmentioning

confidence: 99%

“…For simplicity, we omit subscript p when p only represents an arbitrary person, e.g., taking x p 1:t as x 1:t . Instead of absolute joint positions in the world coordinate, we use y i = x i+1 − x i to obtain instantaneous pose displacement at time i, which will provides more valuable dynamics information [34,39]. The whole displacement sequence is defined as Y 1:T = {y 1 , y 2 , ..., y T }.…”

Section: Problem Definitionmentioning

confidence: 99%

“…For example, Guo et al [15] propose a collaborative prediction task and perform future motion prediction for only two interacted dancers, which inevitably ignores low interaction influence on one's future behavior. Wang et al [39] use local and global Transformers to learn indi-vidual motion and social interactions separately in a crowd scene. The aforementioned methods ignore the interactive influences of body parts and only learn temporal and social relationships without modeling fine-grained body interaction, which makes it difficult to capture complex interaction dependencies.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting

Peng¹,

Mao²,

Wu³

2023

Preprint

View full text Add to dashboard Cite

Figure 1. (a) In complex crowd scenarios, different people may interact with one another at varying levels (low and high interactions) and at different positions (i.e., between near and far distances). (b) The illustration of our main idea on body part interactions. We divide the body joints into 5 parts, and the Intra-Individual branch is used to explore part relationships for each individual and the Inter-Individual branch aims to capture interaction dependencies of body parts between individuals. Our TBIFomer facilitates to model body part interactions for intra-and inter-individuals simultaneously.

show abstract

Section: Single-person Pose Forecastingmentioning

confidence: 99%

Section: Multi-person Pose Forecastingmentioning

confidence: 99%

Section: Problem Definitionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting

Peng¹,

Mao²,

Wu³

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…An attention mechanism adaptively weighs the keys of different key-value pairs based on their relative importance to a given query to predict the most suitable responses to the query [45]. Depending on the data paradigm of the key, the value, and the query, attention mechanisms are used in a wide variety of tasks, including tasks in natural language understanding [9], text-based image and video retrieval [4], object and action recognition in images and videos [46,40], and visual question answering [57]. In the case of userspecific highlight detection, the key, value, and query need to be based on the video contents, i.e., follow the paradigm of content-based highlight detection [42,37,2] to perform meaningful retrieval of the highlightable clips per user.…”

Section: Introductionmentioning

confidence: 99%

Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention

Bhattacharya,

Wu,

Petrangeli

et al. 2022

Preprint

View full text Add to dashboard Cite

We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pretrained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object-and humanactivity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object-and human-activity-based feature representations to validate that our method is indeed both contentbased and user-specific.

show abstract