Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Parida, Kranti Kumar; Srivastava, Siddharth; Sharma, Gaurav

doi:10.1109/wacv51458.2022.00221

Cited by 12 publications

(2 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers have used these to solve the problem in other domains, such as image, audio, and video. Various transformers are proposed to handle a set of modalities such as video with text, image with text, and image with depth [52]. These are famous as Multimodal transformers [53], [54].…”

Section: Previous Workmentioning

confidence: 99%

Learning Speaker-specific Lip-to-Speech Generation

Varshney¹,

Yadav²,

Namboodiri³

et al. 2022

Preprint

View full text Add to dashboard Cite

Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-theart on GRID dataset.

show abstract

Section: Previous Workmentioning

confidence: 99%

Learning Speaker-specific Lip-to-Speech Generation

Varshney¹,

Yadav²,

Namboodiri³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…It would be nice if the two approaches can be connected together. This paper showed a method for generating binaural audio with a deep neural network instead of HRTFs [11]. They also included a system to extract positional information from images, and with the model, they were also able to estimate a depth map to aid the generation of the final binaural signal, which is something we have not implemented in our system.…”

Section: Related Workmentioning

confidence: 99%

An Context-Aware Intelligent System to Automate the Conversion of 2D Audio to 3D Audio using Signal Processing and Machine Learning

Gao¹,

Sun²

2022

Artificial Intelligence and Fuzzy Logic System

View full text Add to dashboard Cite

As virtual reality technologies emerge, the ability to create immersive experiences visually drastically improved [1]. However, in order to accompany the visual immersion, audio must also become more immersive [2]. This is where 3D audio comes in. 3D audio allows for the simulation of sounds from specific directions, allowing a more realistic feeling [3]. At the present moment, there lacks sufficient tools for users to design immersive audio experiences that fully exploit the abilities of 3D audio. This paper proposes and implements the following systems [4]: 1. Automatic separation of stems from the incoming audio file, or letting the user upload the stems themselves 2. A simulated environment in which the separated stems will be automatically placed in 3. A user interface in order to manipulate the simulated positions of the separated stems. We applied our application to a few selected audio files in order to conduct a qualitative evaluation of our approach. The results show that our approach was able to successfully separate the stems and simulate a dimensional sound effect.

show abstract

Foundation Models for Speech, Images, Videos, and Control

Paaß

Giesselbach

2023

Artificial Intelligence: Foundations, Theory, and Algorithms

View full text Add to dashboard Cite

Foundation Models are able to model not only tokens of natural language but also token elements of arbitrary sequences. For images, square image patches can be represented as tokens; for videos, we can define tubelets that span an image patch across multiple frames. Subsequently, the proven self-attention algorithms can be applied to these tokens. Most importantly, several modalities like text and images can be processed in the same sequence allowing, for instance, the generation of images from text and text descriptions from video. In addition, the models are scalable to very large networks and huge datasets. The following multimedia types are covered in the subsequent sections. Speech recognition and text-to-speech models describe the translation of spoken language into text and vice versa. Image processing has the task to interpret images, describe them by captions, and generate new images according to textual descriptions. Video interpretation aims at recognizing action in videos and describing them through text. Furthermore, new videos can be created according to a textual description. Dynamical system trajectories characterize sequential decision problems, which can be simulated and controlled. DNA and protein sequences can be analyzed with Foundation Models to predict the structure and properties of the corresponding molecules.

show abstract

Beyond Mono to Binaural: Generating Binaural Audio from Mono Audio with Depth and Cross Modal Attention

Cited by 12 publications

References 24 publications

Learning Speaker-specific Lip-to-Speech Generation

Learning Speaker-specific Lip-to-Speech Generation

An Context-Aware Intelligent System to Automate the Conversion of 2D Audio to 3D Audio using Signal Processing and Machine Learning

Foundation Models for Speech, Images, Videos, and Control

Contact Info

Product

Resources

About