Music teaching is an activity completed by both "teaching" and "learning". Therefore, whether the two sides of the teaching can achieve close communication and cooperation has become a very important factor in determining the quality and efficiency of music course teaching. The multi-modal teaching mode integrates multiple modes, fully mobilizes students' various senses, attracts students' attention, makes students more relaxed and easy to grasp and understand the knowledge they have learned, and achieves better teaching effects. In the multimodal music teaching class, teachers make full use of ppt courseware to play audio and video, display pictures and tables, use artistic words, font colors and underlines, and use rich body language to enable students to learn language and music knowledge in vivid teaching activities. Human-Computer Interaction is a key technology in the field of virtual reality. Its main purpose is to improve the interaction between the user and the computer from the user's point of view, thereby increasing the immersion and authenticity of the system. In human-computer interaction, human behavior recognition is a very critical technology, which helps human beings to achieve a natural and harmonious state of human-computer interaction. This paper studied the multimodal music teaching model based on Kinect somatosensory technology. It was found that most of the indicators of the experimental class using the multimodal music teaching model are better than the control class, and the average scores of the experimental class are all more than 85.5. It showed that the use of multimodal teaching mode to teach can enhance students' classroom participation, help students understand and memorize knowledge, and stimulate students' interest in learning.