Automatic recognition and prediction of in-vehicle human activities has a significant impact on the next generation of driver assistance and intelligent autonomous vehicles. In this paper, we present a novel single image driver action recognition algorithm inspired by human perception that often focuses selectively on parts of the images to acquire information at specific places which are distinct to a given task. Unlike existing approaches, we argue that human activity is a combination of pose and semantic contextual cues. In detail, we model this by considering the configuration of body joints, their interaction with objects being represented as a pairwise relation to capture the structural information. Our body-pose and bodyobject interaction representation is built to be semantically rich and meaningful, and is highly discriminative even though it is coupled with a basic linear SVM classifier. We also propose a Multi-stream Deep Fusion Network (MDFN) for combining highlevel semantics with CNN features. Our experimental results demonstrate that the proposed approach significantly improves the drivers' action recognition accuracy on two exacting datasets. Index Terms-Transfer learning, intelligent vehicles, in-vehicle activity monitoring, deep learning, body pose and contextual descriptor, neural network-based fusion.