Ganesh Sivaraman scite author profile

Recently, generative adversarial networks and adversarial autoencoders have gained a lot of attention in machine learning community due to their exceptional performance in tasks such as digit classification and face recognition. They map the autoencoder's bottleneck layer output (termed as code vectors) to different noise Probability Distribution Functions (PDFs), that can be further regularized to cluster based on class information. In addition, they also allow a generation of synthetic samples by sampling the code vectors from the mapped PDFs. Inspired by these properties, we investigate the application of adversarial auto-encoders to the domain of emotion recognition. Specifically, we conduct experiments on the following two aspects: (i) their ability to encode high dimensional feature vector representations for emotional utterances into a compressed space (with a minimal loss of emotion class discriminability in the compressed space), and (ii) their ability to regenerate synthetic samples in the original feature space, to be later used for purposes such as training emotion recognition classifiers. We demonstrate promise of adversarial auto-encoders with regards to these aspects on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) corpus and present our analysis.

show abstract

Articulatory features from deep neural networks and their role in speech recognition

Mitra

Sivaraman

Nam

et al. 2014

View full text Add to dashboard Cite

Quantifying kinematic aspects of reduction in a contrasting rate production task

Tiede

Espy‐Wilson

Goldenberg

et al. 2017

View full text Add to dashboard Cite

Electromagnetic articulometry (EMA) was used to record the 720 phonetically balanced Harvard sentences (IEEE, 1969) from multiple speakers at normal and fast production rates. Participants produced each sentence twice, first at their preferred “normal” speaking rate followed by a “fast” production (for a subset of the sentences two normal rate productions were elicited). They were instructed to produce the “fast” repetition as quickly as possible without making errors. EMA trajectories were obtained at 100 Hz from sensors placed on the tongue, lips, and mandible, corrected for head movement and aligned to the occlusal plane. Synchronized audio was recorded at 22050 Hz. Comparison of normal to fast acoustic durations for paired utterances showed a mean 67% length reduction, and assessed using Mermelstein's method (1975), two fewer syllables on average. A comparison of inflections in vertical jaw movement between paired utterances showed an average of 2.3 fewer syllables. Cross-recurrence analysis of distance maps computed on paired sensor trajectories comparing corresponding normal:normal to normal:fast utterances showed systematically lower determinism and entropy for the cross-rate comparisons, indicating that rate effects on articulator trajectories are not uniform. Examples of rate-related differences in gestural overlap that might account for these differences in predictability will be presented. [Work supported by NSF.]

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.