Abstract:Abstract-Speaker diarization systems aim to segment an audio signal into homogeneous sections with only one active speaker and answer the question "who spoke when?" We present a novel approach to speaker diarization exploiting spatial information through robust statistical modeling of Time Difference of Arrival (TDOA) estimates obtained using pairs of microphones. The TDOAs are modeled with Gaussian Mixture Models (GMM) trained in a robust manner with the expectation-conditional maximization algorithm and mino… Show more
“…In this paper, we consider the diarization of audio recordings using spatial features alone. Several solutions have been proposed utilizing spatial features, which use the time-difference-of-arrival (TDOA) features [4,5,6,7]. However, the estimation of TDOA is sensitive to reverberation and noise.…”
Diarization of audio recordings from ad-hoc mobile devices using spatial information is considered in this paper. A twochannel synchronous recording is assumed for each mobile device, which is used to compute directional statistics separately at each device in a frame-wise manner. The recordings across the mobile devices are asynchronous, but a coarse synchronization is performed by aligning the signals using acoustic events, or real-time clock. Direction statistics computed for all the devices, are then modeled jointly using a Dirichlet mixture model, and the posterior probability over the mixture components is used to derive the diarization information. Experiments on real life recordings using mobile phones show a diarization error rate of less than 14%.
“…In this paper, we consider the diarization of audio recordings using spatial features alone. Several solutions have been proposed utilizing spatial features, which use the time-difference-of-arrival (TDOA) features [4,5,6,7]. However, the estimation of TDOA is sensitive to reverberation and noise.…”
Diarization of audio recordings from ad-hoc mobile devices using spatial information is considered in this paper. A twochannel synchronous recording is assumed for each mobile device, which is used to compute directional statistics separately at each device in a frame-wise manner. The recordings across the mobile devices are asynchronous, but a coarse synchronization is performed by aligning the signals using acoustic events, or real-time clock. Direction statistics computed for all the devices, are then modeled jointly using a Dirichlet mixture model, and the posterior probability over the mixture components is used to derive the diarization information. Experiments on real life recordings using mobile phones show a diarization error rate of less than 14%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.