Emotional problems are common among contemporary college students. To improve their mental health, it is urgent to quickly identify college students' negative emotions, and guide them to improve their emotional development. Students' emotions are expressed through multiple modalities, such as audio, facial expressions, and gestures. Using the complementarity between multi-modal emotional information can improve the accuracy of emotion recognition. This paper proposes a multi-modal emotion recognition method for voice and video images based on deep learning: (1) For voice modal recognition, the voice is firstly preprocessed to extract voice emotional features, and then the attention-based longshort-term memory network (LSTM) is adopted for emotion recognition; (2) For video image modal recognition, the extended local binary pattern (LBP) operator is used to calculate the image features, LBP block weighting and multi-scale partitioning are combined for feature extraction, principal component analysis (PCA) is adopted to reduce the dimensionality of eigenvectors, and the VGG-16 network model is constructed with the transfer learning training strategy to realize emotion recognition. (3) The voice and video image emotions recognized by single-modal recognitions are weighed and subjected to feature fusion at the decision-making layer, and used to classify multi-modal emotions. Experimental results show that on the test set of the cheavd2.0 Chinese emotion database, the recognition accuracy of our multi-modal fusion recognition algorithm is better than the single-modal recognition methods.