Turning attention to a particular speaker when many people talk simultaneously is known as the cocktail party problem. It is still a tough task that remained to be solved especially for single-channel speech separation. Inspired by the physiological phenomenon that humans tend to distinguish some attractive sounds from mixed signals, we propose the multi-head self-attention deep clustering network (ADCNet) for this problem. We creatively combine the widely used deep clustering network with multi-head self-attention mechanism and exploit how the number of heads in multi-head self-attention affects separation performance. We also adopt the density-based canopy K-means algorithm to further improve performance. We trained and evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three talker mixtures. Experimental results show the new approach can achieve a better performance compared with many advanced models. INDEX TERMS Single-channel speech separation, deep clustering, multi-head self-attention, density-based canopy K-means. YAN WANG was born in Anhui, China, in 1996. She is currently pursuing the M.S. degree in communication and information systems with the School of Communication and Information Engineering, Shanghai University. Her current research interests include indoor localization and ultra-wideband location.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.