With the emergence of various text-to-speech (TTS) systems, developers have to provide superior user experience in order to remain competitive. To this end, quality-of-experience (QoE) perception modelling and measurement has become a key priority. QoE models rely on three influence factors: technological, contextual and human. Existing solutions have typically relied on using individual physiological modalities, such as electroencephalography (EEG), to model human influence factors (HIFs). In this paper, we show that fusion of physiological modalities, such as EEG, functional near infrared spectroscopy (fNIRS) and heart rate, provide gains of up to 18.4% relative to utilizing only technological factors and 4% relative to using the best performing individual physiological modality.