Speaker Change Detection Using Fundamental Frequency with Application to Multi-talker Segmentation

IEEE/ACM Trans. Audio Speech Lang. Process.

Moore

et al. 2021

Self Cite

This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.

Section: B Temporal Variations In Fundamental Frequencymentioning

confidence: 99%

“…Speaker change detection is achieved by exploiting the temporal variations in the pitch. To accomplish this the method of [38], that operates using a single track, is utilised here. It was shown in Section III-C that multiple tracks can be generated for the same speaker.…”

Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning

confidence: 99%

Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking of Fundamental Frequency

IEEE/ACM Trans. Audio Speech Lang. Process.

Moore

et al. 2021

Self Cite

“…To obtain the full segmentation it is also necessary to detect speaker changes, ct, when the onset of a new speaker happens after the end-point of the previous speaker. To accomplish this the authors' previous method, that operates using a single track, is further developed here [21]. Speaker change detection is achieved by exploiting the temporal variations in the pitch.…”

Section: Speaker Change Detectionmentioning

confidence: 99%

“…(d) Segments of overlapping speech are identified as the onsets and end-points of multiple, uncorrelated pitch tracks. (e) The complete segmentation is obtained from the union of overlapping speech onsets with the speaker changes detected based on a model of the temporal variation of pitch [21].…”

Section: Introductionmentioning

confidence: 99%

Multiple Hypothesis Tracking for Overlapping Speaker Segmentation

2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

Naylor

2019

Self Cite

Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.

“…A number of approaches were proposed to solve the problem of speaker segmentation. Most of these methods rely on features that fall into three separate categories: acoustic features [9,10], spatial features [11] and linguistic features [12]. More recently data driven, deep learning approaches have become popular [13][14][15].…”

Section: Introductionmentioning

confidence: 99%

Multichannel Overlapping Speaker Segmentation Using Multiple Hypothesis Tracking Of Acoustic And Spatial Features

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Naylor

2021

Self Cite

An essential part of any diarization system is the task of speaker segmentation which is important for many applications including speaker indexing and automatic speech recognition (ASR) in multi-speaker environments. Segmentation of overlapping speech has recently been a key focus of this work. In this paper we explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) of the speaker and the speaker's direction of arrival (DOA) simultaneously. Our proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. An illustrative example of overlapping speech demonstrates the effectiveness of our proposed system. We also undertake a statistical analysis on 12 meetings from the AMI corpus and show an improvement in the HIT rate of 14.1% on average against a commonly used deep learning bidirectional long short term memory network (BLSTM) approach.