A Temporal Dependency Based Multi-modal Active Learning Approach for Audiovisual Event Detection

Thiam, Patrick; Meudt, Sascha; Palm, Guenther; Schwenker, Friedhelm

doi:10.1007/s11063-017-9719-y

Cited by 9 publications

(5 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…After extracting features from CNNs, [28] applied three-layer deep neural network to fuse multimodal features. In addition, [29] used RNN to extract features and proposed a new multi-objective method to focus on specific parts containing the strong emotional information of audio data.…”

Section: Related Studiesmentioning

confidence: 99%

Multimodal modeling of human emotions using sound, image and text fusion

Hosseini

Yamaghani

Arabani

2023

Preprint

View full text Add to dashboard Cite

Multimodal emotion recognition and analysis is considered a developing research field. Improving the multimodal fusion mechanism plays a key role in the more detailed recognition of the recognized emotion. The present study aimed to optimize the performance of the emotion recognition system and presented a model for multimodal emotion recognition from audio, text, and video data. First, the data were fused as a combination of video and audio, then as a combination of audio and text as binary, and finally the results were fused together. The final output included audio, text, and video data taking common features into account. Then, the convolutional neural network, as well as long-term and short-term memory (CNN-LSTM), were used to extract audio. Next, the Inception-Res Net-v2 network was applied for extracting the facial expression in the video. The output fused data were utilized by LSTM as the input of the softmax classifier to recognize the emotion of audio and video features fusion. In addition, the CNN-LSTM was combined in the form of a binary channel for learning audio emotion features. Meanwhile, a Bi-LSTM network was used to extract the text features and softmax was used for classifying the fused features. Finally, the generated results were fused together for the final classification, and the logistic regression model was used for fusion and classification. The results indicated that the recognition accuracy of the proposed method in the IEMOCAP data set was 82.9.

show abstract

Section: Related Studiesmentioning

confidence: 99%

Multimodal modeling of human emotions using sound, image and text fusion

Hosseini

Yamaghani

Arabani

2023

Preprint

View full text Add to dashboard Cite

show abstract

“…The semi-automatic labels are generated by our data driven active learning approach, presented in [62,63]. The basic assumption of this approach is the sparseness of emotional reactions in the audio and video modalities.…”

Section: Data Annotationmentioning

confidence: 99%

“…High technical quality: The technical quality of the data and related signals is also checked and demonstrated via different preliminary classifications conducted on various subsets of the database including: the video data [63], the gesture data [65], the audio data [66], the biophysiological data [67], the speech and the biophysiological data [68], and the multimodal data [69].…”

mentioning

confidence: 99%

The uulmMAC Database—A Multimodal Affective Corpus for Affective Computing in Human-Computer Interaction

Hazer-Rau

Meudt

Daucher

et al. 2020

Sensors

Self Cite

View full text Add to dashboard Cite

In this paper, we present a multimodal dataset for affective computing research acquired in a human-computer interaction (HCI) setting. An experimental mobile and interactive scenario was designed and implemented based on a gamified generic paradigm for the induction of dialog-based HCI relevant emotional and cognitive load states. It consists of six experimental sequences, inducing Interest, Overload, Normal, Easy, Underload, and Frustration. Each sequence is followed by subjective feedbacks to validate the induction, a respiration baseline to level off the physiological reactions, and a summary of results. Further, prior to the experiment, three questionnaires related to emotion regulation (ERQ), emotional control (TEIQue-SF), and personality traits (TIPI) were collected from each subject to evaluate the stability of the induction paradigm. Based on this HCI scenario, the University of Ulm Multimodal Affective Corpus (uulmMAC), consisting of two homogenous samples of 60 participants and 100 recording sessions was generated. We recorded 16 sensor modalities including 4 × video, 3 × audio, and 7 × biophysiological, depth, and pose streams. Further, additional labels and annotations were also collected. After recording, all data were post-processed and checked for technical and signal quality, resulting in the final uulmMAC dataset of 57 subjects and 95 recording sessions. The evaluation of the reported subjective feedbacks shows significant differences between the sequences, well consistent with the induced states, and the analysis of the questionnaires shows stable results. In summary, our uulmMAC database is a valuable contribution for the field of affective computing and multimodal data analysis: Acquired in a mobile interactive scenario close to real HCI, it consists of a large number of subjects and allows transtemporal investigations. Validated via subjective feedbacks and checked for quality issues, it can be used for affective computing and machine learning applications.

show abstract

“…Multi-modal approaches on the other hand, are designed to perform an aggregation of a set of information stemming from multiple and heterogeneous modalities by applying a specific information fusion technique, in order to improve both the performance as well as the robustness of an inference system. Rather than relying on a single channel, an effective and smart combination of complementary information stemming from multiple channels mitigates the drawbacks specific to each single channel, while improving the generalization ability of the optimized inference system in comparison to one based on a single modality (Kächele et al, 2016 ; Bellmann et al, 2018 ; Thiam et al, 2018 ).…”

Section: Introductionmentioning

confidence: 99%

Multi-Modal Pain Intensity Assessment Based on Physiological Signals: A Deep Learning Perspective

et al. 2021

Self Cite

View full text Add to dashboard Cite

Traditional pain assessment approaches ranging from self-reporting methods, to observational scales, rely on the ability of an individual to accurately assess and successfully report observed or experienced pain episodes. Automatic pain assessment tools are therefore more than desirable in cases where this specific ability is negatively affected by various psycho-physiological dispositions, as well as distinct physical traits such as in the case of professional athletes, who usually have a higher pain tolerance as regular individuals. Hence, several approaches have been proposed during the past decades for the implementation of an autonomous and effective pain assessment system. These approaches range from more conventional supervised and semi-supervised learning techniques applied on a set of carefully hand-designed feature representations, to deep neural networks applied on preprocessed signals. Some of the most prominent advantages of deep neural networks are the ability to automatically learn relevant features, as well as the inherent adaptability of trained deep neural networks to related inference tasks. Yet, some significant drawbacks such as requiring large amounts of data to train deep models and over-fitting remain. Both of these problems are especially relevant in pain intensity assessment, where labeled data is scarce and generalization is of utmost importance. In the following work we address these shortcomings by introducing several novel multi-modal deep learning approaches (characterized by specific supervised, as well as self-supervised learning techniques) for the assessment of pain intensity based on measurable bio-physiological data. While the proposed supervised deep learning approach is able to attain state-of-the-art inference performances, our self-supervised approach is able to significantly improve the data efficiency of the proposed architecture by automatically generating physiological data and simultaneously performing a fine-tuning of the architecture, which has been previously trained on a significantly smaller amount of data.

show abstract

A Temporal Dependency Based Multi-modal Active Learning Approach for Audiovisual Event Detection

Cited by 9 publications

References 50 publications

Multimodal modeling of human emotions using sound, image and text fusion

Multimodal modeling of human emotions using sound, image and text fusion

The uulmMAC Database—A Multimodal Affective Corpus for Affective Computing in Human-Computer Interaction

Multi-Modal Pain Intensity Assessment Based on Physiological Signals: A Deep Learning Perspective

Contact Info

Product

Resources

About