Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

Nawaz, Shah; Janjua, Muhammad Kamran; Gallo, Ignazio; Mahmood, Arif; Calefati, Alessandro

doi:10.1109/dicta47822.2019.8945863

“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”

Section: Related Workmentioning

confidence: 99%

“…In [16], same-different classification is performed on the cosine scores between face and voice embeddings to train the system. In [17], a novel loss function is proposed to learn the embeddings in a shared space. Their loss function tries preserving neighborhood constraints within and across modalities.…”

Section: Related Workmentioning

confidence: 99%

“…As described in Section 4, the main goal of the multi-view model is to do cross-modal testing. We simulate this A vs. Test pairs VC2 EER VC1 EER A vs V of [15] NA 29.5 A vs V of [17] NA 29.6 A vs V of [16] 22.5 NA A vs V (our) 29.5 28.0 [25]). Since A vs. V cross-modal verification setting is the most difficult situation, it has higher EER as compared to Table 2 but it is still better than the 50% chance level.…”

Section: Experiments On the Multi-view Modelmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-View Approach to Audio-Visual Speaker Verification

Sarı

¹

,

Singh

²

,

Zhou

³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

show abstract

“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”

Section: Related Workmentioning

confidence: 99%

“…In [16], same-different classification is performed on the cosine scores between face and voice embeddings to train the system. In [17], a novel loss function is proposed to learn the embeddings in a shared space. Their loss function tries preserving neighborhood constraints within and across modalities.…”

Section: Related Workmentioning

confidence: 99%

A Multi-View Approach To Audio-Visual Speaker Verification

Sarı¹,

Singh²,

Zhou³

et al. 2021

Preprint

0

View full text Add to dashboard Cite

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

show abstract