Rohan Kumar Das scite author profile

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD. CCS CONCEPTS• Information systems → Speech / audio search.

show abstract

Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling

Zhou

Tian

et al. 2019

View full text Add to dashboard Cite

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Qin

et al. 2020

View full text Add to dashboard Cite

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge (FFSVC 2020) addresses three different research problems under well-defined conditions: far-field text-dependent speaker verification from single microphone array, far-field textindependent speaker verification from single microphone array, and far-field text-dependent speaker verification from distributed microphone arrays. All three tasks pose a cross-channel challenge to the participants. To simulate the real-life scenario, the enrollment utterances are recorded from close-talk cellphone, while the test utterances are recorded from the far-field microphone arrays. In this paper, we describe the database, the challenge, and the baseline system, which is based on a ResNetbased deep speaker network with cosine similarity scoring. For a given utterance, the speaker embeddings of different channels are equally averaged as the final embedding. The baseline system achieves minDCFs of 0.62, 0.66, and 0.64 and EERs of 6.27%, 6.55%, and 7.18% for task 1, task 2, and task 3, respectively.

show abstract

Lignin-based benzoxazines: A tunable key-precursor for the design of hydrophobic coatings, fire resistant materials and catalyst-free vitrimers

Adjaoud

Puchot

Federico

et al. 2023

Chemical Engineering Journal

View full text Add to dashboard Cite

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

Zhao¹,

Huang²,

Tian³

et al. 2020

Preprint

View full text Add to dashboard Cite

Significance of Subband Features for Synthetic Speech Detection

Yang

Das

2020

IEEE Trans.Inform.Forensic Secur.

View full text Add to dashboard Cite

Self-supervised Speaker Recognition with Loss-gated Learning

Tao¹,

Lee²,

Das³

et al. 2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rohan Kumar Das

Spoof Detection Using Source, Instantaneous Frequency and Cepstral Features

Is Someone Speaking?

Cross-lingual Voice Conversion with Bilingual Phonetic Posteriorgram and Average Modeling

The INTERSPEECH 2020 Far-Field Speaker Verification Challenge

Lignin-based benzoxazines: A tunable key-precursor for the design of hydrophobic coatings, fire resistant materials and catalyst-free vitrimers

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

Significance of Subband Features for Synthetic Speech Detection

Self-supervised Speaker Recognition with Loss-gated Learning

Contact Info

Product

Resources

About