Face expression recognition is a key technology of robot vision, which can help the robotic understand human emotions. However, interference from the real-world, such as light changes, face occlusion, and pose variation, reduces the recognition rate of the model. To solve above problems, in this paper, a novel deep model is proposed to improve the classification accuracy of facial expressions. The proposed model has the following merits: 1) A pose-guided face alignment method is proposed to reduce the intra-class difference, which can overcome the impact of environmental noise; 2) A hybrid feature representation method is proposed to obtain high-level discriminative facial features that achieves better results in classification networks; 3) A lightweight fusion backbone is designed, which combines the VGG-16 and the ResNet to achieve low-data and low-calculation training. Finally, to evaluate the proposed model, we conduct a series of experiments on four benchmark datasets, including the CK+, the JAFFE, the Oulu-CASIA, and the AR. The results show that the proposed model achieves state-of-the-art recognition rates, that is, 98.9%, 96.8%, 94.5%, and 98.7%, respectively. Comparing with the traditional methods and other advanced deep learning methods, the proposed model can comparable performance in a variety of tasks.