Multi-Target DoA Estimation with an Audio-Visual Fusion Mechanism

Qian, Xinyuan; Madhavi, Maulik C.; Pan, Zexu; Wang, Jiadong; Li, Haizhou

doi:10.1109/icassp39728.2021.9413776

Cited by 31 publications

(27 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”

Section: Ablation Studiesmentioning

confidence: 99%

“…Speech is the most natural way of communication between humans. Therefore, the study and development of human-machine interaction systems, such as active speaker detection [1], speaker localization [2], speech recognition [3], and emotion recognition [4] constitutes an important part in today's research. However, these algorithms are adversely affected by the presence of interference speakers and acoustic noise.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

Pan¹,

Wang²,

Borsdorf³

et al. 2022

Preprint

View full text Add to dashboard Cite

The speaker extraction technique seeks to single out the voice of a target speaker from the interfering voices in a speech mixture. Typically an auxiliary reference of the target speaker is used to form voluntary attention. Either a pre-recorded utterance or a synchronized lip movement in a video clip can serve as the auxiliary reference. The use of visual cue is not only feasible, but also effective due to its noise robustness, and becoming popular. However, it is difficult to guarantee that such parallel visual cue is always available in real-world applications where visual occlusion or intermittent communication can occur. In this paper, we study the audiovisual speaker extraction algorithms with intermittent visual cue. We propose a joint speaker extraction and visual embedding inpainting framework to explore the mutual benefits. To encourage the interaction between the two tasks, they are performed alternately with an interlacing structure and optimized jointly. We also propose two types of visual inpainting losses and study our proposed method with two types of popularly used visual embeddings. The experimental results show that we outperform the baseline in terms of signal quality, perceptual quality, and intelligibility.

show abstract

“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”

Section: Ablation Studiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

Pan¹,

Wang²,

Borsdorf³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, the performance of computer processing of speech, such as automatic speech recognition [1], speaker localization [2], active speaker detection [3], and speech emotion recognition [4] degrades dramatically in the presence of interfering speakers. This prompts us to study ways to extract speech similar to how humans perceive.…”

Section: Introductionmentioning

confidence: 99%

Speaker Extraction With Co-Speech Gestures Cue

Pan

Qian

2022

IEEE Signal Process. Lett.

Self Cite

View full text Add to dashboard Cite

Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the cospeech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker.

show abstract

“…The acoustic environment during real-world human-robot interaction can be described as a cocktail party [1], where the speech from a speaker of interest, i.e., target speaker, is often corrupted by interference speakers and background noise. In such a scenario, speech separation or speaker extraction algorithms are usually needed to extract the clean speech signal of the target speaker [2,3], which is a crucial step for downstream applications such as hearing aid development [4], automatic speech recognition [5], and source localization [6].…”

Section: Introductionmentioning

confidence: 99%

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Pan¹,

Meng²,

Li³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the oversuppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality.

show abstract

Multi-Target DoA Estimation with an Audio-Visual Fusion Mechanism

Cited by 31 publications

References 21 publications

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

ImagineNET: Target Speaker Extraction with Intermittent Visual Cue through Embedding Inpainting

Speaker Extraction With Co-Speech Gestures Cue

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Contact Info

Product

Resources

About