A new species of Cyrtodactylus (Squamata: Gekkonidae) from northwestern Vietnam

Target speech extraction, which extracts a single target source in a mixture given clues about the target speaker, has attracted increasing attention. We have recently proposed SpeakerBeam, which exploits an adaptation utterance of the target speaker to extract his/her voice characteristics that are then used to guide a neural network towards extracting speech of that speaker. SpeakerBeam presents a practical alternative to speech separation as it enables tracking speech of a target speaker across utterances, and achieves promising speech extraction performance. However, it sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures, because it is difficult to discriminate the target speaker from the interfering speakers. In this paper, we investigate strategies for improving the speaker discrimination capability of SpeakerBeam. First, we propose a time-domain implementation of SpeakerBeam similar to that proposed for a time-domain audio separation network (TasNet), which has achieved state-of-the-art performance for speech separation. Besides, we investigate (1) the use of spatial features to better discriminate speakers when microphone array recordings are available, (2) adding an auxiliary speaker identification loss for helping to learn more discriminative voice characteristics. We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures, and outperform Tas-Net in terms of target speech extraction.

show abstract

Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds

Kinoshita

Delcroix

Tawara

2021

View full text Add to dashboard Cite

Recent diarization technologies can be categorized into two approaches, i.e., clustering and end-to-end neural approaches, which have different pros and cons. The clustering-based approaches assign speaker labels to speech regions by clustering speaker embeddings such as x-vectors. While it can be seen as a current stateof-the-art approach that works for various challenging data with reasonable robustness and accuracy, it has a critical disadvantage that it cannot handle overlapped speech that is inevitable in natural conversational data. In contrast, the end-to-end neural diarization (EEND), which directly predicts diarization labels using a neural network, was devised to handle the overlapped speech. While the EEND, which can easily incorporate emerging deep-learning technologies, has started outperforming the x-vector clustering approach in some realistic database, it is difficult to make it work for long recordings (e.g., recordings longer than 10 minutes) because of, e.g., its huge memory consumption. Block-wise independent processing is also difficult because it poses an inter-block label permutation problem, i.e., an ambiguity of the speaker label assignments between blocks. In this paper, we propose a simple but effective hybrid diarization framework that works with overlapped speech and for long recordings containing an arbitrary number of speakers. It modifies the conventional EEND framework to output global speaker embeddings so that speaker clustering can be performed across blocks based on a constrained clustering algorithm to solve the permutation problem. With experiments based on simulated noisy reverberant 2-speaker meeting-like data, we show that the proposed framework works significantly better than the original EEND especially when the input data is long.

show abstract

Fully Bayesian inference of multi-mixture Gaussian model and its evaluation using speaker clustering

Tawara

Ogawa

Watanabe

et al. 2012

View full text Add to dashboard Cite

Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning

Tsuchiya

Tawara

Ogawa

et al. 2018

View full text Add to dashboard Cite

Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances

Tawara

Ogawa

Iwata

et al. 2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Naohiro Tawara

Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam

Integrating End-to-End Neural and Clustering-Based Diarization: Getting the Best of Both Worlds

Fully Bayesian inference of multi-mixture Gaussian model and its evaluation using speaker clustering

Speaker Invariant Feature Extraction for Zero-Resource Languages with Adversarial Learning

Frame-Level Phoneme-Invariant Speaker Embedding for Text-Independent Speaker Recognition on Extremely Short Utterances

Contact Info

Product

Resources

About