To address the problems of the traditional human motion gesture tracking and recognition methods, such as poor tracking effect, low recognition accuracy, high frame loss rate, and long-time cost, a dynamic human motion gesture tracking and recognition algorithm using multimode deep learning was proposed. Firstly, the collected human motion images are repaired in the three-dimensional (3D) environment, and the multimodal 3D human motion model is reconstructed using the processed images. Secondly, according to the results of model reconstruction, the camera gesture and other parameters of the keyframe are used to construct the target tracking optimization function so as to achieve the purpose of accurate tracking of human motion. Finally, for multimodal human motion gesture learning, a convolutional neural network (CNN) is developed. The trained CNN is utilized to complete dynamic human motion recognition after convolutional and pooling calculations. The results demonstrate that the proposed algorithm is effective in tracking human motion gestures. The average recognition accuracy is 96%, the average frame loss rate is 8.8%, the time cost is low, and the proposed algorithm has a high F-measure and much lower power consumption than other algorithms, indicating that the proposed algorithm is effective.