“…This idea is depicted in Figure 1 where we learn a model to recover the filter bank (FBANK) features from the mixed FBANK features and then feed each stream of the recovered FBANK features to a conventional LVCSR system for recognition. In the simplest architecture, which is denoted as Arch#1 and illustrated in Figure 1(a), feature separation can be considered as a multi-class regression problem, similar to many previous works [29], [30], [31], [32], [33], [34]. In this architecture, Y, the feature of mixed speech, are used as the input to some deep learning models, such as deep neural networks (DNNs), convolutional neural networks (CNNs), and long short-term memory (LSTM) recurrent neural networks (RNNs), to estimate feature representation of each individual talker.…”