Microblogging platforms such as Twitter have become indispensable for disseminating valuable information, especially at times of natural and man-made disasters. Often people post multimedia contents with images and/or videos to report important information such as casualties, damages of infrastructure, and urgent needs of affected people. Such information can be very helpful for humanitarian organizations for planning adequate response in a time-critical manner. However, identifying disaster information from a vast amount of posts is an arduous task, which calls for an automatic system that can filter out the actionable and non-actionable disaster-related information from social media. While many studies have shown the effectiveness of combining text and image contents for disaster identification, most previous work focused on analyzing only the textual modality and/or applied traditional recurrent neural network (RNN) or convolutional neural network (CNN) which might lead to performance degradation in case of long input sequences. This paper presents a multimodal disaster identification system that utilizes both visual and textual data in a synergistic way by conjoining the influential word features with the visual features to classify tweets. Specifically, we utilize a pretrained convolutional neural network (e.g., ResNet50) to extract visual features and a bidirectional long-term memory (BiLSTM) network with attention mechanism to extract textual features. We then aggregate both visual and textual features by leveraging a feature fusion approach followed by applying the softmax classifier. The evaluations demonstrate that the proposed multimodal system enhances the performance over the existing baselines including both unimodal and multimodal models by attaining approximately 1% and 7% of performance improvement, respectively.