ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414133
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Speech Enhancement Method Conditioned in the Lip Motion and Speaker-Discriminative Embeddings

Abstract: We propose an audio-visual speech enhancement (AVSE) method conditioned both on the speaker's lip motion and on speakerdiscriminative embeddings. We particularly explore a method of extracting the embeddings directly from noisy audio in the AVSE setting without an enrollment procedure. We aim to improve speechenhancement performance by conditioning the model with the embedding. To achieve this goal, we devise an AV voice activity detection (AV-VAD) module and a speaker identification module for the AVSE model.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(3 citation statements)
references
References 23 publications
(35 reference statements)
0
1
0
Order By: Relevance
“…Our method advanced the performance by adding residual connections to the encoder for detailed information and extracting more essential face features by attention mechanism. The STOI and SDR of the proposed model are better than the lip-only methods [26,29], which may result from different visual features. Consequently, the effectiveness of the model separation varies slightly when different visual cues are introduced.…”
Section: Resultsmentioning
confidence: 90%
See 1 more Smart Citation
“…Our method advanced the performance by adding residual connections to the encoder for detailed information and extracting more essential face features by attention mechanism. The STOI and SDR of the proposed model are better than the lip-only methods [26,29], which may result from different visual features. Consequently, the effectiveness of the model separation varies slightly when different visual cues are introduced.…”
Section: Resultsmentioning
confidence: 90%
“…Consequently, Wu et al proposed a lip embedding extractor pre-trained to extract information from the video stream [24], and Lu et al proposed a model that learned the correspondence between speech and speech fluctuations [25]. Ito et al mainly conditioned on lip motion and aimed to extract speaker embedding [26]. They proposed an audio-visual speech enhancement (AVSE) model that leveraged a detection and an identification module to retrieve reliable speaker embeddings.…”
Section: Introductionmentioning
confidence: 99%
“…The auxiliary reference can be a prerecorded reference speech signal, in which the algorithm extracts a speech signal that has a similar voice signature with the reference speech signal [13][14][15][16][17][18]. The video recording of the target speaker also serves as such a reference, in which the algorithm extracts a speech signal that is temporally synchronized with the speaker's motion in the video [19][20][21][22][23][24][25].…”
Section: Introductionmentioning
confidence: 99%