With the development of deep learning, the recognition performance of automatic speech recognition has been greatly improved. On the other hand, there is still a problem of degradation of recognition accuracy due to an increase in the number of false positives of words and speech parts when environmental noise is severe. To solve this problem, many methods have been proposed to suppress the noise and to emphasize only the target speech, i.e., speech enhancement. In most cases, speech enhancement requires some assumptions to be made about the sound source. In addition, conventional speech enhancement methods do not fully utilize the key features in the input signal because they use a single model or network to enhance the speech. In this paper, we report a speech enhancement method based on beamforming using an ensemble time-frequency mask. The ensemble time-frequency mask is generated by estimating and integrating multiple time-frequency masks from multiple speech enhancement methods. The use of time-frequency masks estimated from multiple methods is expected to improve the robustness of the process. We evaluated the proposed method on the CHiME-3 dataset using PESQ and STOI, which are correlated with human auditory perception. In both evaluation metrics, the proposed method outperforms the one without ensemble, indicating the effectiveness of the proposed method. In addition, we conducted a validational experiment on the ensemble method of the proposed method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.