2019 Digital Image Computing: Techniques and Applications (DICTA) 2019
DOI: 10.1109/dicta47822.2019.8945863
|View full text |Cite
|
Sign up to set email alerts
|

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

Abstract: We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need of pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchma… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 26 publications
(13 citation statements)
references
References 24 publications
0
13
0
Order By: Relevance
“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…In [16], same-different classification is performed on the cosine scores between face and voice embeddings to train the system. In [17], a novel loss function is proposed to learn the embeddings in a shared space. Their loss function tries preserving neighborhood constraints within and across modalities.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…In [16], same-different classification is performed on the cosine scores between face and voice embeddings to train the system. In [17], a novel loss function is proposed to learn the embeddings in a shared space. Their loss function tries preserving neighborhood constraints within and across modalities.…”
Section: Related Workmentioning
confidence: 99%