Multimodal affective computing, learning to recognize and interpret human affect and subjective information from multiple data sources, is still challenging because: (i) it is hard to extract informative features to represent human affects from heterogeneous inputs; (ii) current fusion strategies only fuse different modalities at abstract levels, ignoring time-dependent interactions between modalities. Addressing such issues, we introduce a hierarchical multimodal architecture with attention and word-level fusion to classify utterancelevel sentiment and emotion from text and audio data. Our introduced model outperforms state-of-the-art approaches on published datasets, and we demonstrate that our model's synchronized attention over modalities offers visual interpretability.
Human conversation analysis is challenging because the meaning can be expressed through words, intonation, or even body language and facial expression. We introduce a hierarchical encoderdecoder structure with attention mechanism for conversation analysis. The hierarchical encoder learns word-level features from video, audio, and text data that are then formulated into conversation-level features. The corresponding hierarchical decoder is able to predict different attributes at given time instances. To integrate multiple sensory inputs, we introduce a novel fusion strategy with modality attention. We evaluated our system on published emotion recognition, sentiment analysis, and speaker trait analysis datasets. Our system outperformed previous state-ofthe-art approaches in both classification and regressions tasks on three datasets. We also outperformed previous approaches in generalization tests on two commonly used datasets. We achieved comparable performance in predicting co-existing labels using the proposed model
A rural built-up area is one of the most important features of rural regions. Rapid and accurate extraction of rural built-up areas has great significance to rural planning and urbanization. In this paper, the spectral residual method is embedded into a deep neural network to accurately describe the rural built-up areas from large-scale satellite images. Our proposed method is composed of two processes: coarse localization and fine extraction. Firstly, an improved Faster R-CNN (Regions with Convolutional Neural Network) detector is trained to obtain the coarse localization of the candidate built-up areas, and then the spectral residual method is used to describe the accurate boundary of each built-up area based on the bounding boxes. In the experimental part, we firstly explored the relationship between the sizes of built-up areas and the kernels in the spectral residual method. Then, the comparing experiments demonstrate that our proposed method has better performance in the extraction of rural built-up areas.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.