Driving action anomaly detection based on in-cab surveillance video has become the mainstream of current driving action research. However, there is a substantial redundancy of spatio-temporal information in the spatio-temporal action features extracted using only the 3D encoder, which will weaken the distinguishability between normal and anomaly videos, and there will be more difficult samples to be distinguished during the anomaly detection. To alleviate this problem, we propose a dual-stream model combining spatio-temporal and appearance features for anomaly driving action detection. Firstly, the model utilises 3D ResNet and 2D ResNet for feature extraction of spatio-temporal action and appearance information in the video. Next, to fully combine the respective advantages of the cross-dimension information and obtain fused features that can be more easily distinguished, a contrastive learning based alignment approach was used before fusion. In addition, cross-dimensional features are fused by introducing Transformer's cross attention block. Moreover, the supervised contrastive learning objective function is employed to guides the model to distinguish between normal and anomaly driving action. Finally, calculating the feature similarity between the target sample and the normal memory center is utilized to obtain the anomaly score of the target sample. In the experiment, the proposed model DSTANet in this paper achieved the 96.81% AUC in the publicly available DAD dataset, which is higher than the popular models in recent years. Furthermore, feature distinguishability was significantly improved in the visualization experiments. The experiment code and data are publicly available at: https://github.com/CreatedTRYNA/DCC\_Driving\_Anomaly\_Detection-.