Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While finetuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XL-Net to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only finetuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves humanlevel multimodal sentiment analysis performance for the first time in the NLP community.
Humor is a unique and creative communicative behavior often displayed during social interactions. It is produced in a multimodal manner, through the usage of words (text), gestures (visual) and prosodic cues (acoustic). Understanding humor from these three modalities falls within boundaries of multimodal language; a recent research trend in natural language processing that models natural language as it happens in face-to-face communication. Although humor detection is an established research area in NLP, in a multimodal context it has been understudied. This paper presents a diverse multimodal dataset, called UR-FUNNY, to open the door to understanding multimodal language used in expressing humor. The dataset and accompanying studies, present a framework in multimodal humor detection for the natural language processing community. UR-FUNNY is publicly available for research.
Background Access to neurological care for Parkinson disease (PD) is a rare privilege for millions of people worldwide, especially in resource-limited countries. In 2013, there were just 1200 neurologists in India for a population of 1.3 billion people; in Africa, the average population per neurologist exceeds 3.3 million people. In contrast, 60,000 people receive a diagnosis of PD every year in the United States alone, and similar patterns of rising PD cases—fueled mostly by environmental pollution and an aging population—can be seen worldwide. The current projection of more than 12 million patients with PD worldwide by 2040 is only part of the picture given that more than 20% of patients with PD remain undiagnosed. Timely diagnosis and frequent assessment are key to ensure timely and appropriate medical intervention, thus improving the quality of life of patients with PD. Objective In this paper, we propose a web-based framework that can help anyone anywhere around the world record a short speech task and analyze the recorded data to screen for PD. Methods We collected data from 726 unique participants (PD: 262/726, 36.1% were women; non-PD: 464/726, 63.9% were women; average age 61 years) from all over the United States and beyond. A small portion of the data (approximately 54/726, 7.4%) was collected in a laboratory setting to compare the performance of the models trained with noisy home environment data against high-quality laboratory-environment data. The participants were instructed to utter a popular pangram containing all the letters in the English alphabet, “the quick brown fox jumps over the lazy dog.” We extracted both standard acoustic features (mel-frequency cepstral coefficients and jitter and shimmer variants) and deep learning–based embedding features from the speech data. Using these features, we trained several machine learning algorithms. We also applied model interpretation techniques such as Shapley additive explanations to ascertain the importance of each feature in determining the model’s output. Results We achieved an area under the curve of 0.753 for determining the presence of self-reported PD by modeling the standard acoustic features through the XGBoost—a gradient-boosted decision tree model. Further analysis revealed that the widely used mel-frequency cepstral coefficient features and a subset of previously validated dysphonia features designed for detecting PD from a verbal phonation task (pronouncing “ahh”) influence the model’s decision the most. Conclusions Our model performed equally well on data collected in a controlled laboratory environment and in the wild across different gender and age groups. Using this tool, we can collect data from almost anyone anywhere with an audio-enabled device and help the participants screen for PD remotely, contributing to equity and access in neurological care.
Recognizing humor from a video utterance requires understanding the verbal and non-verbal components as well as incorporating the appropriate context and external knowledge. In this paper, we propose Humor Knowledge enriched Transformer (HKT) that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge. We incorporate humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language. We encode all the language, acoustic, vision, and humor centric features separately using Transformer based encoders, followed by a cross attention layer to exchange information among them. Our model achieves 77.36% and 79.41% accuracy in humorous punchline detection on UR-FUNNY and MUStaRD datasets -- achieving a new state-of-the-art on both datasets with the margin of 4.93% and 2.94% respectively. Furthermore, we demonstrate that our model can capture interpretable, humor-inducing patterns from all modalities.
Post-traumatic stress disorder (PTSD) negatively influences a person's ability to cope and increases psychiatric morbidity. The existing diagnostic tools of PTSD are often difficult to administer within marginalized communities due to language and cultural barriers, lack of skilled clinicians, and stigma around disclosing traumatic experiences. We present an initial proof of concept for a novel, low-cost, and creative method to screen the potential cases of PTSD based on free-hand sketches within three different communities in Bangladesh: Rohingya refugees (n = 44), slum-dwellers (n = 35), and engineering students (n = 85). Due to the low overhead and nonverbal nature of sketching, our proposed method potentially overcomes communication and resource barriers. Using corner and edge detection algorithms, we extracted three features (number of corners, number and average length of strokes) from the images of free-hand sketches. We used these features along with sketch themes, participants' gender and group to train multiple logistic regression models for potentially screening PTSD (accuracy: 82.9-87.9%). We improved the accuracy (99.29%) by integrating EEG data with sketch features in a Random Forest model for the refugee population. Our proposed initial assessment method of PTSD based on sketches could potentially be integrated with phones and EEG headsets, making it widely accessible to the underrepresented communities.
Global refugee crisis around the world has displaced millions of people from their homes. Although some of them adjust well, many suffer from significant psychological distress, such as post-traumatic stress disorder (PTSD), owing to exposure to traumatic events and hardships. Here, diagnosis and access to psychological health care present particular challenges for various human-centered design issues. Therefore, analyzing the case of Rohingya refugees in Bangladesh, we propose a two-way diagnosis of PTSD using (i) short inexpensive questionnaire to determine its prevalence, and (ii) low-cost portable EEG headset to identify potential neurobiological markers of PTSD. To the best of our knowledge, this study is the first to use consumer-grade EEG devices in the scarce-resource settings of refugees. Moreover, we explored the underlying structure of PTSD and its symptoms via developing various hybrid models based on Bayesian inference by combining aspects from both reflective and formative models of PTSD, which is also the first of its kind. Our findings revealed several key components of PTSD and its neurobiological abnormality. Moreover, challenges faced during our study would inform design processes of screening tools and treatments of PTSD to incorporate refugee experience in a more meaningful way during contemporary and future humanitarian crisis.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.