Abstract:This paper shows that time varying pitch properties can be used advantageously within the segmentation step of a multi-talker diarization system. First a study is conducted to verify that changes in pitch are strong indicators of changes in the speaker. It is then highlighted that an individual's pitch is smoothly varying and therefore can be predicted by means of a Kalman filter. Subsequently it is shown that if the pitch is not predictable then this is most likely due to a change in the speaker. Finally, a n… Show more
“…To explore this question on the AMI corpus [36], pitch estimates were calculated using the method of [21] applied to the IHM mixed-down stream of 16 AMI meetings. A Kalman filter [37] was used to track the pitch of the IHM mixed-down single channel stream as proposed in [38]. The Kalman track relies on the smooth variation of a speaker's pitch due to physiological constraints [39].…”
Section: B Temporal Variations In Fundamental Frequencymentioning
confidence: 99%
“…Speaker change detection is achieved by exploiting the temporal variations in the pitch. To accomplish this the method of [38], that operates using a single track, is utilised here. It was shown in Section III-C that multiple tracks can be generated for the same speaker.…”
Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning
This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.
“…To explore this question on the AMI corpus [36], pitch estimates were calculated using the method of [21] applied to the IHM mixed-down stream of 16 AMI meetings. A Kalman filter [37] was used to track the pitch of the IHM mixed-down single channel stream as proposed in [38]. The Kalman track relies on the smooth variation of a speaker's pitch due to physiological constraints [39].…”
Section: B Temporal Variations In Fundamental Frequencymentioning
confidence: 99%
“…Speaker change detection is achieved by exploiting the temporal variations in the pitch. To accomplish this the method of [38], that operates using a single track, is utilised here. It was shown in Section III-C that multiple tracks can be generated for the same speaker.…”
Section: A Exp-1: Full Segmentation Using Proposed Systemmentioning
This paper demonstrates how the harmonic structure of voiced speech can be exploited to segment multiple overlapping speakers in a speaker diarization task. We explore how a change in the speaker can be inferred from a change in pitch. We show that voiced harmonics can be useful in detecting when more than one speaker is talking, such as during overlapping speaker activity. A novel system is proposed to track multiple harmonics simultaneously, allowing for the determination of onsets and end-points of a speaker's utterance in the presence of an additional active speaker. This system is bench-marked against a segmentation system from the literature that employs a bidirectional long short term memory network (BLSTM) approach and requires training. Experimental results highlight that the proposed approach outperforms the BLSTM baseline approach by 12.9% in terms of HIT rate for speaker segmentation. We also show that the estimated pitch tracks of our system can be used as features to the BLSTM to achieve further improvements of 1.21% in terms of coverage and 2.45% in terms of purity.
“…To obtain the full segmentation it is also necessary to detect speaker changes, ct, when the onset of a new speaker happens after the end-point of the previous speaker. To accomplish this the authors' previous method, that operates using a single track, is further developed here [21]. Speaker change detection is achieved by exploiting the temporal variations in the pitch.…”
Section: Speaker Change Detectionmentioning
confidence: 99%
“…(d) Segments of overlapping speech are identified as the onsets and end-points of multiple, uncorrelated pitch tracks. (e) The complete segmentation is obtained from the union of overlapping speech onsets with the speaker changes detected based on a model of the temporal variation of pitch [21].…”
Speaker segmentation is an essential part of any diarization system. Applications of diarization include tasks such as speaker indexing, improving automatic speech recognition (ASR) performance and making single speaker-based algorithms available for use in multi-speaker environments. This paper proposes a multiple hypothesis tracking (MHT) method that exploits the harmonic structure associated with the pitch in voiced speech in order to segment the onsets and end-points of speech from multiple, overlapping speakers. The proposed method is evaluated against a segmentation system from the literature that uses a spectral representation and is based on employing bidirectional long short term memory networks (BLSTM). The proposed method is shown to achieve comparable performance for segmenting overlapping speakers only using the pitch harmonic information in the MHT framework.
“…A number of approaches were proposed to solve the problem of speaker segmentation. Most of these methods rely on features that fall into three separate categories: acoustic features [9,10], spatial features [11] and linguistic features [12]. More recently data driven, deep learning approaches have become popular [13][14][15].…”
An essential part of any diarization system is the task of speaker segmentation which is important for many applications including speaker indexing and automatic speech recognition (ASR) in multi-speaker environments. Segmentation of overlapping speech has recently been a key focus of this work. In this paper we explore the use of a new multimodal approach for overlapping speaker segmentation that tracks both the fundamental frequency (F0) of the speaker and the speaker's direction of arrival (DOA) simultaneously. Our proposed multiple hypothesis tracking system, which simultaneously tracks both features, shows an improvement in segmentation performance when compared to tracking these features separately. An illustrative example of overlapping speech demonstrates the effectiveness of our proposed system. We also undertake a statistical analysis on 12 meetings from the AMI corpus and show an improvement in the HIT rate of 14.1% on average against a commonly used deep learning bidirectional long short term memory network (BLSTM) approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.