Video prediction has developed rapidly after the booming of deep learning. As an important part of unsupervised representation learning, it plays an important role in anomalous behavior detection, autonomous driving, video games, and other fields. However, the prediction method based on optical flow estimation is susceptible to brightness change and camera shake, and it is difficult to predict occluded objects. While the prediction method based on pixel generation is difficult to fit ambiguous and complex scenes, which leads to a blurry prediction. In this work, we proposed an end-to-end video prediction framework that combines the optical flow estimation module with the pixel generation module by a learnable mask weight to predict high-fidelity videos. To further improve the prediction effect, we introduce adversarial training to the framework. We introduced a frame discriminator and a sequence discriminator to ensure the consistency of the spatio-temporal distribution of predicted video frames and real video frames. The results of experiments on challenging datasets demonstrate the practicability and effectiveness of our proposed video prediction framework. On the one hand, our proposed framework has achieved an equal quality compared with the current latest model, which requires fewer parameters and has a faster prediction speed. On the other hand, the results of ablation experiments demonstrate the effect of fusing different modules and the effectiveness of adversarial training.INDEX TERMS adversarial training, convolutional neural network, optical flow prediction, pixel generation, video prediction.