2022
DOI: 10.1109/lsp.2022.3175130
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Extraction With Co-Speech Gestures Cue

Abstract: Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained fro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 13 publications
(6 citation statements)
references
References 46 publications
0
6
0
Order By: Relevance
“…Text2sign [21] HLSTM [22] ESN [23] DiffGAN [24] RG [25] SEEG [26] HA2G [27] Paul Duchnowski [28] Gérard Bailly [29] Wav2lip [30] Audio2head [31] AD-NERF [32] DiffTalk [33] Rule-based [34] LBP [35] SDF [36] PTSLP [37] Syn [38] MMFSL [39] Re-Syn [40] CMML [41] DTW [42] HMMs [43] FCN [44] RL [45] CS [1] MCCS [46] SLreview [2] UnorgSign [47] CoSreview [6] CoSreview [3] SE [48] THreview [4] THE [49] VHTHG [50] Fig. 3: Structured taxonomy of the existing BL research which includes three genres.…”
Section: Sign Language Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Text2sign [21] HLSTM [22] ESN [23] DiffGAN [24] RG [25] SEEG [26] HA2G [27] Paul Duchnowski [28] Gérard Bailly [29] Wav2lip [30] Audio2head [31] AD-NERF [32] DiffTalk [33] Rule-based [34] LBP [35] SDF [36] PTSLP [37] Syn [38] MMFSL [39] Re-Syn [40] CMML [41] DTW [42] HMMs [43] FCN [44] RL [45] CS [1] MCCS [46] SLreview [2] UnorgSign [47] CoSreview [6] CoSreview [3] SE [48] THreview [4] THE [49] VHTHG [50] Fig. 3: Structured taxonomy of the existing BL research which includes three genres.…”
Section: Sign Language Recognitionmentioning
confidence: 99%
“…CoS signals not only play a crucial role in enhancing the clarity, expressiveness and emotional content of verbal communication, but also capture the rich communicative context, and reveal the speaker's social identity and cultural affiliation [48]. Therefore, it is a growing trend towards exploring multi-modal approaches that take into account both the visual information from gestures and the accompanying speech signals, which allows for more comprehensive and accurate analysis in the areas such as emotion recognition and dialogue understanding [75].…”
Section: Co-speechmentioning
confidence: 99%
“…Over the past decade, there has been remarkable progress in equipping machines with this ability, paving the way for being integrated into hearing aids. Those algorithms usually rely on a reference cue of the to-be-extracted target speech signal, commonly given as speech signal [7]- [12], or based on a different modality such as face [13]- [15], text [16], [17], or even gesture [18] information. However, in real-world conversational situations, it might be hard to acquire those cues and the quality may also vary due to occlusion, changes in the light setting, body movement, or interfering signals.…”
Section: Introductionmentioning
confidence: 99%
“…Although a great success, there is inherent ambiguity in speaker labeling of the separated signals. An auxiliary reference, such as a pre-recorded speech signal [13][14][15][16] or video frame sequence [17][18][19][20], can be used to solve the speaker ambiguity. The speaker extraction algorithm employs such auxiliary reference to form top-down attention on the target speaker and extracts its speech.…”
Section: Introductionmentioning
confidence: 99%