ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747019
|View full text |Cite
|
Sign up to set email alerts
|

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 20 publications
0
6
0
Order By: Relevance
“…In addition, it can provide a more stable performance under different condition, e.g., it can still show a satisfying performance with block length of 2s on the Test set. Actually, compare with the offline TS-VAD [79], the improvement of multi-channel extension in online VAD is moderate, the reason is that we reduce the Encoder size to ensure enough GPU memory for training (3 layers 4 heads v.s. 6 layers 8 heads).…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…In addition, it can provide a more stable performance under different condition, e.g., it can still show a satisfying performance with block length of 2s on the Test set. Actually, compare with the offline TS-VAD [79], the improvement of multi-channel extension in online VAD is moderate, the reason is that we reduce the Encoder size to ensure enough GPU memory for training (3 layers 4 heads v.s. 6 layers 8 heads).…”
Section: Resultsmentioning
confidence: 99%
“…IV shows the comparison with other systems on the Alimeeting datset. For offline system, we show the official baseline [72] and the winner's system [79]. For online system, we do not find other online system evaluated on the Alimeeting dataset, but we can directly compare it with the offline TS-VAD system.…”
Section: B Comparison With Other Offline and Online Systemsmentioning
confidence: 99%
“…Furthermore, the TS-VAD framework has been investigated for multi-channel signal [65], vision-guided system [36], and online inference [66], [67]. Integrating features of both TS-VAD and EEND methods into an entire system has also become a popular trend [20], [55], [68], [69].…”
Section: Target-speaker Voice Activity Detectionmentioning
confidence: 99%
“…Dinkel et al identify that traditional VAD algorithms are trained on data devoid of such acoustic distortions, and therefore their usage is limited to data without the acoustic distortions that are inevitable in the real world, rending them unable to perform well in real-world settings. Other works on VAD include Wang et al [44] that uses a cross channel attention based model to achieve voice activity detection in the M2met challenge, Braun et al [5] that is specifically concerned about dealing with the robustness issue of many state-of-the-art models. What is worth-noting is that, some works developed for other purposes such as transcription, can be used as voice activity detection models.…”
Section: Voice Activity Detectionmentioning
confidence: 99%