In order to achieve accurate parsing of multiple elements of short videos, multimodal short video processing with text, image, and speech as the three elements needs to be realized. Specifically, text information is the external marker data presented by short videos, and text information retrieval can currently be achieved by crawler crawling technology. Image information is the frame data extracted from the short video at certain time intervals by means of frame processing, and there are many types of image formats. The frame structure and pixel composition of different types of image files are completely different. Voice information is the stereo background music present in short videos, and this music voice also has different formats. Different voice formats also require different technical processing. Therefore, for short video recommendation techniques, the processing results of the above three text, image, and voice information need to be fully understood and analyzed comprehensively in order to come up with the best recommendation results. In this paper, based on multimodal content analysis techniques, we propose a framework for integrated processing of the three elements of information of short videos to form a unified information representation and to realize the elemental processing of the three elements of multimodal short videos at the semantic level. Multimodal content analysis plays an important role in short video recommendation applications. In this paper, a multimodal data fusion short video recommendation method is proposed, in which the key issue is how to effectively combine information from different modalities in order to provide recommendations to users with target information. The method not only considers user behavior features and short video label features, but also maps video from high-dimensional space to low-dimensional dense space, and extract the short video word vector features. At the same time, we consider that there is a relationship between the user's clicking behavior and the short video cover image and profile. The short video image features and text features are fused with the structured features and trained together to complete the short video recommendation task. The method makes full use of the differences and similarities between different models to increase the model's understanding of user behavior and improve the short video recommendation.