Recently, generative adversarial networks and adversarial autoencoders have gained a lot of attention in machine learning community due to their exceptional performance in tasks such as digit classification and face recognition. They map the autoencoder's bottleneck layer output (termed as code vectors) to different noise Probability Distribution Functions (PDFs), that can be further regularized to cluster based on class information. In addition, they also allow a generation of synthetic samples by sampling the code vectors from the mapped PDFs. Inspired by these properties, we investigate the application of adversarial auto-encoders to the domain of emotion recognition. Specifically, we conduct experiments on the following two aspects: (i) their ability to encode high dimensional feature vector representations for emotional utterances into a compressed space (with a minimal loss of emotion class discriminability in the compressed space), and (ii) their ability to regenerate synthetic samples in the original feature space, to be later used for purposes such as training emotion recognition classifiers. We demonstrate promise of adversarial auto-encoders with regards to these aspects on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and present our analysis.
Electromagnetic articulometry (EMA) was used to record the 720 phonetically balanced Harvard sentences (IEEE, 1969) from multiple speakers at normal and fast production rates. Participants produced each sentence twice, first at their preferred “normal” speaking rate followed by a “fast” production (for a subset of the sentences two normal rate productions were elicited). They were instructed to produce the “fast” repetition as quickly as possible without making errors. EMA trajectories were obtained at 100 Hz from sensors placed on the tongue, lips, and mandible, corrected for head movement and aligned to the occlusal plane. Synchronized audio was recorded at 22050 Hz. Comparison of normal to fast acoustic durations for paired utterances showed a mean 67% length reduction, and assessed using Mermelstein's method (1975), two fewer syllables on average. A comparison of inflections in vertical jaw movement between paired utterances showed an average of 2.3 fewer syllables. Cross-recurrence analysis of distance maps computed on paired sensor trajectories comparing corresponding normal:normal to normal:fast utterances showed systematically lower determinism and entropy for the cross-rate comparisons, indicating that rate effects on articulator trajectories are not uniform. Examples of rate-related differences in gestural overlap that might account for these differences in predictability will be presented. [Work supported by NSF.]
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.