Almost all visual transformers such as ViT [14] or DeiT [41] rely on predefined positional encodings to incorporate the order of each input token. These encodings are often implemented as learnable fixed-dimension vectors or sinusoidal functions of different frequencies, which are not possible to accommodate variable-length input sequences. This inevitably limits a wider application of transformers in vision, where many tasks require changing the input size onthe-fly.In this paper, we propose to employ an implicit conditional position encodings scheme, which is conditioned on the local neighborhood of the input token. It is effortlessly implemented as what we call Position Encoding Generator (PEG), which can be seamlessly incorporated into the current transformer framework. Our new model with PEG is named Conditional Position encodings Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings. We obtain state-of-the-art results on the ImageNet classification task compared with visual Transformers to date. Our code will be made available at https://github.com/ Meituan-AutoML/CPVT .
Figure 1: Our system automatically and accurately reconstructs 3D skeletal poses in real time using monocular depth data obtained from a single camera. (top) reference image data; (bottom) the reconstructed poses overlaying depth data. AbstractWe present a fast, automatic method for accurately capturing fullbody motion data using a single depth camera. At the core of our system lies a realtime registration process that accurately reconstructs 3D human poses from single monocular depth images, even in the case of significant occlusions. The idea is to formulate the registration problem in a Maximum A Posteriori (MAP) framework and iteratively register a 3D articulated human body model with monocular depth cues via linear system solvers. We integrate depth data, silhouette information, full-body geometry, temporal pose priors, and occlusion reasoning into a unified MAP estimation framework. Our 3D tracking process, however, requires manual initialization and recovery from failures. We address this challenge by combining 3D tracking with 3D pose detection. This combination not only automates the whole process but also significantly improves the robustness and accuracy of the system. Our whole algorithm is highly parallel and is therefore easily implemented on a GPU. We demonstrate the power of our approach by capturing a wide range of human movements in real time and achieve state-ofthe-art accuracy in our comparison against alternative systems such as Kinect [2012].
This paper introduces an approach to performance animation that employs a small number of motion sensors to create an easy-to-use system for an interactive control of a full-body human character.Our key idea is to construct a series of online local dynamic models from a prerecorded motion database and utilize them to construct full-body human motion in a maximum a posteriori framework (MAP). We have demonstrated the effectiveness of our system by controlling a variety of human actions, such as boxing, golf swinging, and table tennis, in real time. Given an appropriate motion capture database, the results are comparable in quality to those obtained from a commercial motion capture system with a full set of motion sensors (e.g., XSens [2009]); however, our performance animation system is far less intrusive and expensive because it requires a small of motion sensors for full body control. We have also evaluated the performance of our system by leave-one-out-experiments and by comparing with two baseline algorithms.
This paper introduces an efficient algorithm that reconstructs 3D human poses as well as camera parameters from a small number of 2D point correspondences obtained from uncalibrated monocular images. This problem is challenging because 2D image constraints (e.g. 2D point correspondences) are often not sufficient to determine 3D poses of an articulated object. The key idea of this paper is to identify a set of new constraints and use them to eliminate the ambiguity of 3D pose reconstruction. We also develop an efficient optimization process to simultaneously reconstruct both human poses and camera parameters from various forms of reconstruction constraints. We demonstrate the power and effectiveness of our system by evaluating the performance of the algorithm on both real and synthetic data. We show the algorithm can accurately reconstruct 3D poses and camera parameters from a wide variety of real images, including internet photos and key frames extracted from monocular video sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.