2022
DOI: 10.1109/lsp.2022.3175130
|View full text |Cite
|
Sign up to set email alerts
|

Speaker Extraction With Co-Speech Gestures Cue

Abstract: Speaker extraction seeks to extract the clean speech of a target speaker from a multi-talker mixture speech. There have been studies to use a pre-recorded speech sample or face image of the target speaker as the speaker cue. In human communication, co-speech gestures that are naturally timed with speech also contribute to speech perception. In this work, we explore the use of co-speech gestures sequence, e.g. hand and body movements, as the speaker cue for speaker extraction, which could be easily obtained fro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 8 publications
(4 citation statements)
references
References 46 publications
(38 reference statements)
0
4
0
Order By: Relevance
“…Text2sign [21] HLSTM [22] ESN [23] DiffGAN [24] RG [25] SEEG [26] HA2G [27] Paul Duchnowski [28] Gérard Bailly [29] Wav2lip [30] Audio2head [31] AD-NERF [32] DiffTalk [33] Rule-based [34] LBP [35] SDF [36] PTSLP [37] Syn [38] MMFSL [39] Re-Syn [40] CMML [41] DTW [42] HMMs [43] FCN [44] RL [45] CS [1] MCCS [46] SLreview [2] UnorgSign [47] CoSreview [6] CoSreview [3] SE [48] THreview [4] THE [49] VHTHG [50] Fig. 3: Structured taxonomy of the existing BL research which includes three genres.…”
Section: Sign Language Recognitionmentioning
confidence: 99%
See 1 more Smart Citation
“…Text2sign [21] HLSTM [22] ESN [23] DiffGAN [24] RG [25] SEEG [26] HA2G [27] Paul Duchnowski [28] Gérard Bailly [29] Wav2lip [30] Audio2head [31] AD-NERF [32] DiffTalk [33] Rule-based [34] LBP [35] SDF [36] PTSLP [37] Syn [38] MMFSL [39] Re-Syn [40] CMML [41] DTW [42] HMMs [43] FCN [44] RL [45] CS [1] MCCS [46] SLreview [2] UnorgSign [47] CoSreview [6] CoSreview [3] SE [48] THreview [4] THE [49] VHTHG [50] Fig. 3: Structured taxonomy of the existing BL research which includes three genres.…”
Section: Sign Language Recognitionmentioning
confidence: 99%
“…CoS signals not only play a crucial role in enhancing the clarity, expressiveness and emotional content of verbal communication, but also capture the rich communicative context, and reveal the speaker's social identity and cultural affiliation [48]. Therefore, it is a growing trend towards exploring multi-modal approaches that take into account both the visual information from gestures and the accompanying speech signals, which allows for more comprehensive and accurate analysis in the areas such as emotion recognition and dialogue understanding [75].…”
Section: Co-speechmentioning
confidence: 99%
“…Although a great success, there is inherent ambiguity in speaker labeling of the separated signals. An auxiliary reference, such as a pre-recorded speech signal [13][14][15][16] or video frame sequence [17][18][19][20], can be used to solve the speaker ambiguity. The speaker extraction algorithm employs such auxiliary reference to form top-down attention on the target speaker and extracts its speech.…”
Section: Introductionmentioning
confidence: 99%
“…A recent work [23] uses audio features to select relevant visual features in combination with an attention mechanism and a data augmentation strategy in order to address low-resolution, lip-occlusion, and outof-sync problems altogether. The SEG network [24] avoids the need for lip recordings since it utilizes the co-speech gesture cue from the upper-body video recording as a reference, which is less prone to occlusions.…”
Section: Introductionmentioning
confidence: 99%