Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge

Wang, Weiqing; Qin, Xiaoyi; Li, Ming

doi:10.1109/icassp43922.2022.9747019

Cited by 17 publications

(6 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, it can provide a more stable performance under different condition, e.g., it can still show a satisfying performance with block length of 2s on the Test set. Actually, compare with the offline TS-VAD [79], the improvement of multi-channel extension in online VAD is moderate, the reason is that we reduce the Encoder size to ensure enough GPU memory for training (3 layers 4 heads v.s. 6 layers 8 heads).…”

Section: Resultsmentioning

confidence: 99%

“…IV shows the comparison with other systems on the Alimeeting datset. For offline system, we show the official baseline [72] and the winner's system [79]. For online system, we do not find other online system evaluated on the Alimeeting dataset, but we can directly compare it with the offline TS-VAD system.…”

Section: B Comparison With Other Offline and Online Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Yang¹,

Wang²

2022

Interspeech 2022

View full text Add to dashboard Cite

This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. By adapting the conventional target speaker voice activity detection for real-time operation, this framework can identify speaker activities using self-generated embeddings, resulting in consistent performance without permutation inconsistencies in the inference phase. During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these framelevel speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal. Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: B Comparison With Other Offline and Online Systemsmentioning

confidence: 99%

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Yang¹,

Wang²

2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Furthermore, the TS-VAD framework has been investigated for multi-channel signal [65], vision-guided system [36], and online inference [66], [67]. Integrating features of both TS-VAD and EEND methods into an entire system has also become a popular trend [20], [55], [68], [69].…”

Section: Target-speaker Voice Activity Detectionmentioning

confidence: 99%

Online Target Speaker Voice Activity Detection for Speaker Diarization

Wang¹,

Li²,

Lin³

2022

Interspeech 2022

View full text Add to dashboard Cite

Audio-visual learning has demonstrated promising results in many classical speech tasks (e.g., speech separation, automatic speech recognition, wake-word spotting). We believe that introducing visual modality will also benefit speaker diarization. To date, Target-Speaker Voice Activity Detection (TS-VAD) plays an important role in highly accurate speaker diarization. However, previous TS-VAD models take audio features and utilize the speaker's acoustic footprint to distinguish his or her personal speech activities, which is easily affected by overlapped speech in multi-speaker scenarios. Although visual information naturally tolerates overlapped speech, it suffers from spatial occlusion, low resolution, etc. The potential modality-missing problem blocks TS-VAD towards an audio-visual approach.This paper proposes a novel Multi-Input Multi-Output Target-Speaker Voice Activity Detection (MIMO-TSVAD) framework for speaker diarization. The proposed method can take audio-visual input and leverage the speaker's acoustic footprint or lip track to flexibly conduct audio-based, video-based, and audio-visual speaker diarization in a unified sequence-to-sequence framework. Experimental results show that the MIMO-TSVAD framework demonstrates state-of-the-art performance on the VoxConverse, DIHARD-III, and MISP 2022 datasets under corresponding evaluation metrics, obtaining the Diarization Error Rates (DERs) of 4.18%, 10.10%, and 8.15%, respectively. In addition, it can perform robustly in heavy lip-missing scenarios.

show abstract

“…Dinkel et al identify that traditional VAD algorithms are trained on data devoid of such acoustic distortions, and therefore their usage is limited to data without the acoustic distortions that are inevitable in the real world, rending them unable to perform well in real-world settings. Other works on VAD include Wang et al [44] that uses a cross channel attention based model to achieve voice activity detection in the M2met challenge, Braun et al [5] that is specifically concerned about dealing with the robustness issue of many state-of-the-art models. What is worth-noting is that, some works developed for other purposes such as transcription, can be used as voice activity detection models.…”

Section: Voice Activity Detectionmentioning

confidence: 99%

Integrating Voice-Based Machine Learning Technology into Complex Home Environments

Gao¹,

Jabbour²,

Ko³

et al. 2022

Preprint

View full text Add to dashboard Cite

To demonstrate the value of machine learning based smart health technologies, researchers have to deploy their solutions into complex real-world environments with real participants. This gives rise to many, oftentimes unexpected, challenges for creating technology in a lab environment that will work when deployed in real home environments. In other words, like more mature disciplines, we need solutions for what can be done at development time to increase success at deployment time. To illustrate an approach and solutions, we use an example of an ongoing project that is a pipeline of voice based machine learning solutions that detects the anger and verbal conflicts of the participants. For anonymity, we call it the XYZ system. XYZ is a smart health technology because by notifying the participants of their anger, it encourages the participants to better manage their emotions. This is important because being able to recognize one's emotions is the first step to better managing one's anger. XYZ was deployed in 6 homes for 4 months each and monitors the emotion of the caregiver of a dementia patient. In this paper we demonstrate some of the necessary steps to be accomplished during the development stage to increase deployment time success, and show where continued work is still necessary. Note

show abstract

Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for the M2met Challenge

Cited by 17 publications

References 20 publications

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Robust End-to-end Speaker Diarization with Generic Neural Clustering

Online Target Speaker Voice Activity Detection for Speaker Diarization

Integrating Voice-Based Machine Learning Technology into Complex Home Environments

Contact Info

Product

Resources

About