Facial expression recognition (FER) is advancing human-computer interaction, especially, today, where facial masks are commonly worn due to the COVID-19 pandemic. Traditional unimodal techniques for facial expression recognition may be ineffective under these circumstances. To address this challenge, multimodal approaches that incorporate data from various modalities, such as voice expressions, have emerged as a promising solution. This paper proposed a novel multimodal methodology based on deep learning to recognize facial expressions under masked conditions effectively. The approach utilized two standard datasets, M-LFW-F and CREMA-D, to capture facial and vocal emotional expressions. A multimodal neural network was then trained using fusion techniques, outperforming conventional unimodal methods in facial expression recognition. Experimental evaluations demonstrated that the proposed approach achieved an accuracy of 79.81%, a significant improvement over the 68.81% accuracy attained by the unimodal technique. These results highlight the superior performance of the proposed approach in facial expression recognition under masked conditions.INDEX TERMS Deep learning, multimodal fusion techniques, neural network, facial expression under the mask.