ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414023
|View full text |Cite
|
Sign up to set email alerts
|

Muse: Multi-Modal Target Speaker Extraction with Visual Cues

Abstract: Speaker extraction algorithm relies on the speech sample from the target speaker as the reference point to focus its attention. Such a reference speech is typically pre-recorded. On the other hand, the temporal synchronization between speech and lip movement also serves as an informative cue. Motivated by this idea, we study a novel technique to use speech-lip visual cues to extract reference target speech directly from mixture speech during inference time, without the need of pre-recorded reference speech. We… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
9
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

2
6

Authors

Journals

citations
Cited by 21 publications
(9 citation statements)
references
References 25 publications
(49 reference statements)
0
9
0
Order By: Relevance
“…2) Visual encoder: The visual encoder seeks to encode the target's lip images sequence into a sequence of visual embeddings V (t) ∈ R N ×T , representing the target speaker's visemes, and in sync with the target's speech. We design the visual encoder with a structure similar to the visual encoder in MuSE [67], which consists of a 3-dimensional (3D) convolution conv3D, an 18 layer residual convolutional neural network resnet18, 5 repeated visual temporal convolutional network V -T CN , and an up-sampling layer.…”
Section: ) Speech Encodermentioning
confidence: 99%
See 2 more Smart Citations
“…2) Visual encoder: The visual encoder seeks to encode the target's lip images sequence into a sequence of visual embeddings V (t) ∈ R N ×T , representing the target speaker's visemes, and in sync with the target's speech. We design the visual encoder with a structure similar to the visual encoder in MuSE [67], which consists of a 3-dimensional (3D) convolution conv3D, an 18 layer residual convolutional neural network resnet18, 5 repeated visual temporal convolutional network V -T CN , and an up-sampling layer.…”
Section: ) Speech Encodermentioning
confidence: 99%
“…The inputs to the speaker extractor are the speech embeddings X(t) and the visual embeddings V (t). The studies on the reentry model [5], TDSE [66], and MuSE [67] suggest that, by concatenating the time-aligned visual embeddings with their corresponding speech embeddings, the speaker extractor is able to effectively estimate the mask M (t). We adopt the concatenation approach at the start of the speaker extractor.…”
Section: ) Speech Encodermentioning
confidence: 99%
See 1 more Smart Citation
“…A number of audio-visual speaker extraction works explore the visual and contextual information by using the visemephoneme mapping cues [28][29][30]. They encode the lip images into visemes using a visual encoder pre-trained on the lip reading task, in which each viseme maps to multiple phonemes.…”
Section: Introductionmentioning
confidence: 99%
“…The auxiliary reference can be a prerecorded reference speech signal, in which the algorithm extracts a speech signal that has a similar voice signature with the reference speech signal [13][14][15][16][17][18]. The video recording of the target speaker also serves as such a reference, in which the algorithm extracts a speech signal that is temporally synchronized with the speaker's motion in the video [19][20][21][22][23][24][25].…”
Section: Introductionmentioning
confidence: 99%