Ashish Sardana scite author profile

2020

Sentiment Analysis and Emotion Detection in conversation is key in several real-world applications, with an increase in modalities available aiding a better understanding of the underlying emotions. Multi-modal Emotion Detection and Sentiment Analysis can be particularly useful, as applications will be able to use specific subsets of available modalities, as per the available data. Current systems dealing with Multi-modal functionality fail to leverage and capture -the context of the conversation through all modalities, the dependency between the listener(s) and speaker emotional states, and the relevance and relationship between the available modalities. In this paper, we propose an end to end RNN architecture that attempts to take into account all the mentioned drawbacks. Our proposed model, at the time of writing, out-performs the state of the art on a benchmark dataset on a variety of accuracy and regression metrics. * * The following work was pursued when author was an intern at NVIDIA Graphics, Bengaluru

show abstract

Speech Prediction in Silent Videos Using Variational Autoencoders

Sardana²,

et al. 2021

Understanding the relationship between the auditory and visual signals is crucial for many different applications ranging from computer-generated imagery (CGI) and video editing automation to assisting people with hearing or visual impairments. However, this is challenging since the distribution of both audio and visual modality is inherently multimodal. Therefore, most of the existing methods ignore the multimodal aspect and assume that there only exists a deterministic one-to-one mapping between the two modalities. It can lead to low-quality predictions as the model collapses to optimizing the average behavior rather than learning the full data distributions. In this paper, we present a stochastic model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory signal's conditional distribution given the visual signal. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.

show abstract

Bridged Variational Autoencoders for Joint Modeling of Images and Attributes

et al. 2020

Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameterefficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.

show abstract

Learning to Predict Speech in Silent Videos Via Audiovisual Analogy

Sardana²,

et al. 2022

Speech Prediction in Silent Videos using Variational Autoencoders

Yadav¹,

Sardana²,

Namboodiri³

et al. 2020

Preprint

Stochastic Talking Face Generation Using Latent Distribution Matching

et al. 2020

Evaluating advanced machine learning techniques for pulsar detection from HTRU survey

Punia

Subashini

2017

NDNet

Mandhana¹,

Agrawal²,

Sardana³

2021