Semi-supervised classification-aware cross-modal deep adversarial data augmentation

Wang, Shaoqiang; Wu, Zhenzhen; He, Gewen; Wang, Shudong; Sun, Hongwei; Fan, Fangfang

doi:10.1016/j.future.2021.05.029

Cited by 10 publications

(7 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To conclude, although, in the literature, other works also perform multimodal emotion recognition on RAVDESS, such as Wang et al [66], that used facial images to generate spectrograms, which were then used as data augmentation to improve the SER model performance in six emotions; to our knowledge, our work is the first that evaluates a late fusion strategy using the visual information of RAVDESS for facial emotion recognition using the eight emotions of the dataset with a pre-trained STN and the aural modality.…”

Section: Multimodal Emotion Recognitionmentioning

confidence: 96%

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Jiménez

Griol

Callejas

et al. 2021

Sensors

View full text Add to dashboard Cite

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.

show abstract

Section: Multimodal Emotion Recognitionmentioning

confidence: 96%

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Jiménez

Griol

Callejas

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…e Yolo v4 algorithm through the CSPDarknet53 network features extract the image in S × S grid, target detection through the target center in the grid, using the residual network sampling and sampling features, the maximum pooling of different scales after stacking, finally after the size of the target category and position [28].…”

Section: Yolo V4mentioning

confidence: 99%

Combining Self-Supervised Learning and Yolo v4 Network for Construction Vehicle Detection

Zhang

Hou

2022

Mobile Information Systems

View full text Add to dashboard Cite

At present, there are many application fields of target detection, but it is very difficult to apply intelligent traffic target detection in the construction site because of the complex environment and many kinds of engineering vehicles. A method based on self-supervised learning combined with the Yolo (you only look once) v4 network defined as “SSL-Yolo v4” (self-supervised learning-Yolo v4) is proposed for the detection of construction vehicles. Based on the combination of self-supervised learning network and Yolo v4 algorithm network, a self-supervised learning method based on context rotation is introduced. By using this method, the problem that a large number of manual data annotations are needed in the training of existing deep learning algorithms is solved. Furthermore, the self-supervised learning network after training is combined with Yolo v4 network to improve the prediction ability, robustness, and detection accuracy of the model. The performance of the proposed model is optimized by performing five-fold cross validation on the self-built dataset, and the effectiveness of the algorithm is verified. The simulation results show that the average detection accuracy of the SSL-Yolo v4 method combined with self-supervised learning is 92.91%, 4.83% detection speed is improved, 7–8 fps detection speed is improved, and 8–9% recall rate is improved. The results show that the method has higher precision and speed and improves the ability of target prediction and the robustness of engineering vehicle detection.

show abstract

“…To sum up, despite the fact that other works in the literature also performed multimodal emotion recognition on RAVDESS, such as Wang et al [ 33 ], who used facial images to generate spectrograms, which were then used for data augmentation to improve the SER model performance in six emotions, our work is the first that, to our knowledge, detects the stressed and relaxed state using the audio-visual information of RAVDESS by means of aural and facial emotion recognition using the eight emotions.…”

Section: Literature Reviewmentioning

confidence: 99%

Audio-Visual Stress Classification Using Cascaded RNN-LSTM Networks

et al. 2022

View full text Add to dashboard Cite

The purpose of this research is to emphasize the importance of mental health and contribute to the overall well-being of humankind by detecting stress. Stress is a state of strain, whether it be mental or physical. It can result from anything that frustrates, incenses, or unnerves you in an event or thinking. Your body’s response to a demand or challenge is stress. Stress affects people on a daily basis. Stress can be regarded as a hidden pandemic. Long-term (chronic) stress results in ongoing activation of the stress response, which wears down the body over time. Symptoms manifest as behavioral, emotional, and physical effects. The most common method involves administering brief self-report questionnaires such as the Perceived Stress Scale. However, self-report questionnaires frequently lack item specificity and validity, and interview-based measures can be time- and money-consuming. In this research, a novel method used to detect human mental stress by processing audio-visual data is proposed. In this paper, the focus is on understanding the use of audio-visual stress identification. Using the cascaded RNN-LSTM strategy, we achieved 91% accuracy on the RAVDESS dataset, classifying eight emotions and eventually stressed and unstressed states.

show abstract

Semi-supervised classification-aware cross-modal deep adversarial data augmentation

Cited by 10 publications

References 32 publications

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Combining Self-Supervised Learning and Yolo v4 Network for Construction Vehicle Detection

Audio-Visual Stress Classification Using Cascaded RNN-LSTM Networks

Contact Info

Product

Resources

About