Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

Sarı, Leda; Thomas, Samuel; Hasegawa‐Johnson, Mark

doi:10.1109/icassp40776.2020.9054664

“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”

Section: Related Workmentioning

confidence: 99%

A Multi-View Approach to Audio-Visual Speaker Verification

Sarı

¹

,

Singh

²

,

Zhou

³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

show abstract

“…For most of its history, SLU was developed in a pipelined fashion, with ASR feeding text to a natural language understanding system, e.g., to the best of our knowledge, the only published uses of SLU with knowledge graphs that fit this description is (Woods, 1975). Recent research in end-to-end multimodal SLU bypasses the need for ASR by leveraging a parallel modality such as image (Harwath et al, 2016;Kamper et al, 2019) or video (Sanabria et al, 2018), or a non-parallel corpus of text (Sarı et al, 2020), to guide learning speech embeddings such that the speech input can be used in a downstream task.…”

Section: Related Work: Multimodal Slumentioning

confidence: 99%

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

Ramnath¹,

Sarı

²

,

Hasegawa‐Johnson

³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

Self Cite

5

0

View full text Add to dashboard Cite

Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called Worldly-Wise (WoW). WoW is shown to perform endto-end cross-lingual FVSQA at same levels of accuracy across 3 languages -English, Hindi, and Turkish.

show abstract

“…Cross-modal processing has been recently used in different combinations such as audio-video [15,14,16,17] and speech-text [18]. The common approach in these studies is to map inputs from different modalities into a shared space to achieve cross-modal retrieval.…”

Section: Related Workmentioning

confidence: 99%

A Multi-View Approach To Audio-Visual Speaker Verification

Sarı¹,

Singh²,

Zhou³

et al. 2021

Preprint

0

View full text Add to dashboard Cite

Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel approach to handle cross-modal verification at test time. Specifically, we investigate unimodal and concatenation based AV fusion and report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset using our best system. As these methods lack the ability to do cross-modal verification, we introduce a multi-view model which uses a shared classifier to map audio and video into the same space. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.

show abstract

Training Spoken Language Understanding Systems with Non-Parallel Speech and Text

Cited by 11 publications

References 15 publications

A Multi-View Approach to Audio-Visual Speaker Verification

A Multi-View Approach to Audio-Visual Speaker Verification

Worldly Wise (WoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering

A Multi-View Approach To Audio-Visual Speaker Verification

Contact Info

Product

Resources

About