Online users are more willing to express their opinions and emotions on social platforms in online public opinion events, through combing texts and images rather than texts alone. Facing the change of emotional expressions, it is extremely vital to recognize the sentiments of online users in online public opinion events in an appropriate way. In this paper, we propose a novel Deep Neural Networks (DNNs) model to recognize online users' sentiments in online public opinion events by analyzing the sentiments of texts and attached images together. Also, we compare two fusion strategies, feature‐level fusion and decision‐level fusion, to combine the affective information from texts and images. In feature‐level fusion, fine‐tuned Convolutional Neural Networks (CNNs) and Bidirectional Long Short‐Term Memory (BiLSTMs) are employed to extract visual and textual features respectively. Then the features are concatenated and fed to a DNNs system to complete classification. In decision‐level fusion, a rule is employed to fuse the output from unimodality, generating final predicted labels. Experimental results showed that the proposed DNNs multimodal model achieved better performance than unimodal sentiment recognition model. In fusion strategy, the feature‐level fusion performed better in our experiments.