Virtual reality technology, with its continuous development, is gradually applied to healthcare, education, business, and other fields. In the application of the technology, position and attitude estimation, as a space positioning technology, is indispensable. Traditional pose estimation has the problems of high dependence on environment and great complexity. But convolutional neural network (CNN) and other technologies with computational intelligence provide a strong guarantee for the progress of pose estimation. This article, based on the theory of CNN in deep learning, as well as monocular vision system and target sample set with markers, proposes a method for estimation of target position and attitude, and at the same time, describes in detail a general way of making dataset with markers based on simulation environment. In this article, the comparative experiments of different network structures show that this measurement method can avoid manual extraction of complex image features, and realize fast, arbitrary and accurate measurement, which plays a key role in pose and attitude measurement. Moreover, the visual correspondence between the world coordinate system and the pixel coordinate system is proved effectively by quaternion.