2021
DOI: 10.1109/taslp.2021.3078883
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Multi-Channel Integration and Recognition of Overlapped Speech

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
24
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6

Relationship

2
4

Authors

Journals

citations
Cited by 20 publications
(24 citation statements)
references
References 62 publications
0
24
0
Order By: Relevance
“…Following the previous researches on audio-visual multi-channel speech separation [35,36], temporal convolutional networks (TCNs) [39] are used in the speech separation module. As shown in the left corner of Figure 1, the log-power spectrum (LPS) features of the reference microphone channel were initially concatenated with the IPDs and AF features computed above before being fed into the TCN based audio block to compute the audio embedding.…”
Section: Audio and Visual Modality Inputsmentioning
confidence: 99%
See 4 more Smart Citations
“…Following the previous researches on audio-visual multi-channel speech separation [35,36], temporal convolutional networks (TCNs) [39] are used in the speech separation module. As shown in the left corner of Figure 1, the log-power spectrum (LPS) features of the reference microphone channel were initially concatenated with the IPDs and AF features computed above before being fed into the TCN based audio block to compute the audio embedding.…”
Section: Audio and Visual Modality Inputsmentioning
confidence: 99%
“…Before fusing the visual features with the audio embedding to improve the estimation, the lip features are firstly fed into the visual block containing 5 TCNs (Figure 1, bottom left in grey) to compute the visual embedding. Audio-visual modality fusion: In this work, a factorised attentionbased modality fusion method consistent with our previous work [35] was utilised in the separation module. This attention based fusion block (Figure 1, left middle in dark brown) combines the audio and visual embeddings from the outputs of the audio and visual TCN embedding blocks respectively.…”
Section: Audio and Visual Modality Inputsmentioning
confidence: 99%
See 3 more Smart Citations