Abstract:Most of the prior studies in the spatial Direction of Arrival (DoA) domain focus on a single modality. However, humans use auditory and visual senses to detect the presence of sound sources. With this motivation, we propose to use neural networks with audio and visual signals for multi-speaker localization. The use of heterogeneous sensors can provide complementary information to overcome uni-modal challenges, such as noise, reverberation, illumination variations, and occlusions. We attempt to address these is… Show more
“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”
Section: Ablation Studiesmentioning
confidence: 99%
“…Speech is the most natural way of communication between humans. Therefore, the study and development of human-machine interaction systems, such as active speaker detection [1], speaker localization [2], speech recognition [3], and emotion recognition [4] constitutes an important part in today's research. However, these algorithms are adversely affected by the presence of interference speakers and acoustic noise.…”
The speaker extraction technique seeks to single out the voice of a target speaker from the interfering voices in a speech mixture. Typically an auxiliary reference of the target speaker is used to form voluntary attention. Either a pre-recorded utterance or a synchronized lip movement in a video clip can serve as the auxiliary reference. The use of visual cue is not only feasible, but also effective due to its noise robustness, and becoming popular. However, it is difficult to guarantee that such parallel visual cue is always available in real-world applications where visual occlusion or intermittent communication can occur. In this paper, we study the audiovisual speaker extraction algorithms with intermittent visual cue. We propose a joint speaker extraction and visual embedding inpainting framework to explore the mutual benefits. To encourage the interaction between the two tasks, they are performed alternately with an interlacing structure and optimized jointly. We also propose two types of visual inpainting losses and study our proposed method with two types of popularly used visual embeddings. The experimental results show that we outperform the baseline in terms of signal quality, perceptual quality, and intelligibility.
“…7) most of the time, although the improvements are not as high as under conditions with missing visual frames (Sys. [1][2][3][4][5][6].…”
Section: Ablation Studiesmentioning
confidence: 99%
“…Speech is the most natural way of communication between humans. Therefore, the study and development of human-machine interaction systems, such as active speaker detection [1], speaker localization [2], speech recognition [3], and emotion recognition [4] constitutes an important part in today's research. However, these algorithms are adversely affected by the presence of interference speakers and acoustic noise.…”
The speaker extraction technique seeks to single out the voice of a target speaker from the interfering voices in a speech mixture. Typically an auxiliary reference of the target speaker is used to form voluntary attention. Either a pre-recorded utterance or a synchronized lip movement in a video clip can serve as the auxiliary reference. The use of visual cue is not only feasible, but also effective due to its noise robustness, and becoming popular. However, it is difficult to guarantee that such parallel visual cue is always available in real-world applications where visual occlusion or intermittent communication can occur. In this paper, we study the audiovisual speaker extraction algorithms with intermittent visual cue. We propose a joint speaker extraction and visual embedding inpainting framework to explore the mutual benefits. To encourage the interaction between the two tasks, they are performed alternately with an interlacing structure and optimized jointly. We also propose two types of visual inpainting losses and study our proposed method with two types of popularly used visual embeddings. The experimental results show that we outperform the baseline in terms of signal quality, perceptual quality, and intelligibility.
“…However, the performance of computer processing of speech, such as automatic speech recognition [1], speaker localization [2], active speaker detection [3], and speech emotion recognition [4] degrades dramatically in the presence of interfering speakers. This prompts us to study ways to extract speech similar to how humans perceive.…”
Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained from low-resolution video recordings, thus more available than face recordings. We propose two networks using the co-speech gestures cue to perform attentive listening on the target speaker, one that implicitly fuses the cospeech gestures cue in the speaker extraction process, the other performs speech separation first, followed by explicitly using the co-speech gestures cue to associate a separated speech to the target speaker. The experimental results show that the co-speech gestures cue is informative in associating with the target speaker.
“…The acoustic environment during real-world human-robot interaction can be described as a cocktail party [1], where the speech from a speaker of interest, i.e., target speaker, is often corrupted by interference speakers and background noise. In such a scenario, speech separation or speaker extraction algorithms are usually needed to extract the clean speech signal of the target speaker [2,3], which is a crucial step for downstream applications such as hearing aid development [4], automatic speech recognition [5], and source localization [6].…”
Speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the oversuppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.