With the rapid development of sensor technology and artificial intelligence, the video gesture recognition technology under the background of big data makes human-computer interaction more natural and flexible, bringing the richer interactive experience to teaching, on-board control, electronic games etc. To perform robust recognition under the conditions of illumination change, background clutter, rapid movement, and partial occlusion, an algorithm based on multi-level feature fusion of two-stream convolutional neural network is proposed, which includes three main steps. Firstly, the Kinect sensor obtains redgreen-blue-depth (RGB-D) images to establish a gesture database. At the same time, data enhancement is performed on the training set and test set. Then, a model of multi-level feature fusion of a two-stream convolutional neural network is established and trained. Experiments show that the proposed network model can robustly track and recognise gestures under complex backgrounds (such as similar complexion, illumination changes, and occlusion), and compared with the single-channel model, the average detection accuracy is improved by 1.08%, and mean average precision is improved by 3.56%.