Audio-Visual Speech Enhancement Method Conditioned in the Lip Motion and Speaker-Discriminative Embeddings

Ito, Koichiro; Yamamoto, Masaaki; Nagamatsu, Kenji

doi:10.1109/icassp39728.2021.9414133

Cited by 8 publications

(3 citation statements)

References 23 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our method advanced the performance by adding residual connections to the encoder for detailed information and extracting more essential face features by attention mechanism. The STOI and SDR of the proposed model are better than the lip-only methods [26,29], which may result from different visual features. Consequently, the effectiveness of the model separation varies slightly when different visual cues are introduced.…”

Section: Resultsmentioning

confidence: 90%

“…Consequently, Wu et al proposed a lip embedding extractor pre-trained to extract information from the video stream [24], and Lu et al proposed a model that learned the correspondence between speech and speech fluctuations [25]. Ito et al mainly conditioned on lip motion and aimed to extract speaker embedding [26]. They proposed an audio-visual speech enhancement (AVSE) model that leveraged a detection and an identification module to retrieve reliable speaker embeddings.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Li,

Fu,

Sun

et al. 2023

Sensors

View full text Add to dashboard Cite

The cocktail party problem can be more effectively addressed by leveraging the speaker’s visual and audio information. This paper proposes a method to improve the audio’s separation using two visual cues: facial features and lip movement. Firstly, residual connections are introduced in the audio separation module to extract detailed features. Secondly, considering the video stream contains information other than the face, which has a minimal correlation with the audio, an attention mechanism is employed in the face module to focus on crucial information. Then, the loss function considers the audio-visual similarity to take advantage of the relationship between audio and visual completely. Experimental results on the public VoxCeleb2 dataset show that the proposed model significantly enhanced SDR, PSEQ, and STOI, especially 4 dB improvements in SDR.

show abstract

Section: Resultsmentioning

confidence: 90%

Section: Introductionmentioning

confidence: 99%

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

Li,

Fu,

Sun

et al. 2023

Sensors

View full text Add to dashboard Cite

show abstract

“…The auxiliary reference can be a prerecorded reference speech signal, in which the algorithm extracts a speech signal that has a similar voice signature with the reference speech signal [13][14][15][16][17][18]. The video recording of the target speaker also serves as such a reference, in which the algorithm extracts a speech signal that is temporally synchronized with the speaker's motion in the video [19][20][21][22][23][24][25].…”

Section: Introductionmentioning

confidence: 99%

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Pan¹,

Meng²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Speaker extraction algorithm extracts the target speech from a mixture speech containing interference speech and background noise. The extraction process sometimes over-suppresses the extracted target speech, which not only creates artifacts during listening but also harms the performance of downstream automatic speech recognition algorithms. We propose a hybrid continuity loss function for time-domain speaker extraction algorithms to settle the over-suppression problem. On top of the waveform-level loss used for superior signal quality, i.e., SI-SDR, we introduce a multi-resolution delta spectrum loss in the frequency-domain, to ensure the continuity of an extracted speech signal, thus alleviating the oversuppression. We examine the hybrid continuity loss function using a time-domain audio-visual speaker extraction algorithm on the YouTube LRS2-BBC dataset. Experimental results show that the proposed loss function reduces the over-suppression and improves the word error rate of speech recognition on both clean and noisy two-speakers mixtures, without harming the reconstructed speech quality.

show abstract

Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network

Zhang

2023

Neurocomputing

View full text Add to dashboard Cite

Audio-Visual Speech Enhancement Method Conditioned in the Lip Motion and Speaker-Discriminative Embeddings

Cited by 8 publications

References 23 publications

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

A Facial Feature and Lip Movement Enhanced Audio-Visual Speech Separation Model

A Hybrid Continuity Loss to Reduce Over-Suppression for Time-domain Target Speaker Extraction

Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network

Contact Info

Product

Resources

About