The temporal and spatial neural processing of faces has been investigated rigorously, but few studies have unified these dimensions to reveal the spatio-temporal dynamics postulated by the models of face processing. We used support vector machine decoding and representational similarity analysis to combine information from different locations (fMRI), time windows (EEG), and theoretical models. By correlating information matrices derived from pairwise classifications of neural responses to different facial expressions (neutral, happy, fearful, angry), we found early EEG time windows (starting around 130 ms) to match fMRI data from early visual cortex (EVC), and later time windows (starting around 190ms) to match data from occipital and fusiform face areas (OFA/FFA) and posterior superior temporal sulcus (pSTS).According to model comparisons, the EEG classifications were based more on low-level visual features than expression intensities or categories. In fMRI, the model comparisons revealed change along the processing hierarchy, from low-level visual feature coding in EVC to coding of intensity of expressions in the right pSTS. The results highlight the importance of a multimodal approach for understanding the functional roles of different brain regions in face processing.