Photovoltaic (PV) power generation has high uncertainties due to the randomness and imbalance nature of solar energy and meteorological parameters. Hence, accurate PV power forecasts are essential in the operation of PV power plants (PVPP) for short-term dispatches and power generation schedules. In this paper, a new deep neural network structure based on vision transformer is proposed to combine sky images and Tokens-To-Token(T2T) for photovoltaic power prediction. The method uses an incremental tokenization module to aggregate neighboring image patches into tokens, which capture the local structural information of the clouds. Then, an efficient T2T-ViT backbone network is used to extract the global attentional relationships of the tokens for power prediction. In order to evaluate the performance of the proposed model, the method was compared with several deep learning architectures such as ResNet and GoogleNet on a dataset collected by the National Renewable Energy Laboratory in Colorado, USA. The results of power prediction were analysed using training loss, prediction error, and linear regression, and they show that the proposed method achieves higher prediction accuracy and lower error compared to the existing methods, especially in short- and ultra-short-term prediction. The paper demonstrates the potential of applying Transformer models to computer vision tasks for renewable energy forecasting. The results show that the proposed method achieves higher prediction accuracy and lower error than several deep learning architectures, such as ResNet and GoogleNet, especially in short- and ultra-short-term prediction.