Despite a lot of progress in speech separation, enhancement, and automatic speech recognition realistic meeting recognition is still fairly unsolved. Most research on speech separation either focuses on spectral cues to address single-channel recordings or spatial cues to separate multichannel recordings and exclusively either rely on neural networks or probabilistic graphical models. Integrating a spatial clustering approach and a deep learning approach using spectral cues in a single framework can significantly improve automatic speech recognition performance and improve generalizability given that a neural network profits from a vast amount of training data while the probabilistic counterpart adapts to the current scene. This thesis at hand, therefore, concentrates on the integration of two fairly disjoint research streams, namely single-channel deep learning-based source separation and multi-channel probabilistic modelbased source separation. It provides a general framework to integrate spatial and spectral cues in which neural networks and probabilistic graphical models complement each other in achieving state of the art performance in blind source separation on noisy, reverberant data. The efficacy of the proposed approaches is evaluated on simulated artificial mixtures as well as real recordings of simultaneously active speakers. The key findings are (1) a cascade integration in which a neural network initializes a probabilistic graphical model provides substantial improvement, (2) spatial cues can be used for unsupervised training of neural networks, (3) tight integration, an integration in which a joint agreement between both modalities and models is found, leads to lowest word error rates and best generalizability to unseen real mixtures.