With the development of deep learning technology and mobile Internet, more and more image-based artificial intelligence applications appear in people's lives. In the image information, the understanding of the characters in the picture has always been the focus of research and application, and it is also the basis of human-computer interaction. The human body key point detection technology can detect the joint point position of the target person in the image, so as to provide basic information for subsequent human-computer interaction applications and functions. The main purpose of this paper is to conduct research on gymnastics AR based on deep convolutional neural networks (CNN). This paper first expounds the mechanism contained in each component of the CNN. Analyzing the characteristics of CNN, compared with the multi-layer BP neural network, the transmission between the CNN neurons is combined with the local receptive field through weight sharing, which reduces the weight parameters while maintaining the network depth. The gradient disappearance problem can be avoided in the training process, and its network structure has good generalization ability. Through performance comparison experiments, it is found that whether using single-stream RGB data, optical flow data, or dual-stream fusion results, the recognition accuracy of deep neural networks is better than that of conventional networks. Illustrating the effectiveness of deep CNNs in gymnastics action recognition (AR).