This paper presents 6D vision transformer (6D-ViT), a transformer-based instance representation learning network that is suitable for highly accurate category-level object pose estimation on RGB-D images. Specifically, a novel two-stream encoder-decoder framework is dedicated to exploring complex and powerful instance representations from RGB images, point clouds and categorical shape priors. The whole framework consists of two main branches, named Pixelformer and Pointformer. Pixelformer contains a pyramid transformer encoder with an all-multilayer perceptron (MLP) decoder to extract pixelwise appearance representations from RGB images, while Pointformer relies on a cascaded transformer encoder and an all-MLP decoder to acquire the pointwise geometric characteristics from point clouds. Then, dense instance representations (i.e., correspondence matrix and deformation field) are obtained from a multisource aggregation (MSA) network with shape prior, appearance and geometric information as input. Finally, the instance 6D pose is computed by leveraging the correspondence among dense representations, shape priors, and instance point clouds. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed 3D instance representation learning framework achieves state-of-the-art performance on both types of datasets and significantly outperforms all existing methods. Our code will be available.