We propose Cross-Channel Spectral Subtraction (CCSS), a source separation method for recognizing meeting speech where one microphone is prepared for each speaker. The method quickly adapts to changes in transfer functions and uses spectral subtraction to suppress the speech of other speakers. Compared with conventional source separation methods based on independent component analysis (ICA) or that use binary masks, it requires less computational costs and the resulting speech signals have less distortion. In a recognition task of computer-simulated, partially-overlapped speech, CCSS improved the word accuracy from 66.5% to 77.7%. It also significantly improved the recognition accuracy of speech data in actual meetings.
We propose an overlapped speech detection method for speech recognition and speaker diarization of meetings, where each speaker wears a lapel microphone. Two novel features are utilized as inputs for a GMM-based detector. One is speech power after cross-channel spectral subtraction which reduces the power from the other speakers. The other is an amplitude spectral cosine correlation coefficient which effectively extracts the correlation of spectral components in a rather quiet condition. We evaluated our method using a meeting speech corpus of four speakers. The accuracy of our proposed method, 75.7%, was significantly better than that of the conventional method, 66.8%, which uses raw speech power and power spectral Pearson's correlation coefficient.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.