The automotive smart cockpit is an intelligent and connected in-vehicle consumer electronics product. It can provide a safe, efficient, comfortable, and enjoyable human-machine interaction experience. Emotion recognition technology can help the smart cockpit better understand the driver's needs and state, improve the driving experience, and enhance safety. Currently, driver emotion recognition faces some challenges, such as low accuracy and high latency. In this paper, we propose a multimodal driver emotion recognition model. To our best knowledge, it is the first time to improve the accuracy of driver emotion recognition by using facial video and driving behavior (including brake pedal force, vehicle Y-axis position and Z-axis position) as inputs and employing a multi-task training approach. For verification, the proposed scheme is compared with some mainstream state-of-theart methods on the publicly available multimodal driver emotion dataset PPB-Emo.