Motivation. In the environment of day and night video surveillance, in order to improve the accuracy of machine vision dense crowd counting and target detection, this paper designs a day and night dual-purpose crowd counting and crowd detection network based on multimode image fusion. Methods. Two sub-models, RGBD-Net and RGBT-Net, are designed in this paper. The depth image features and thermal imaging features are effectively fused with the features of visible light images, so that the model has stronger anti-interference characteristics and robustness to the light noise interference caused by the sudden fall of light at night. The above models use density map regression-guided detection method to complete population counting and detection. Results. The model completed daytime training and testing on MICC dataset. Through verification, the average absolute error of the model was 1.025, the mean square error was 1.521, and the recall rate of target detection was 97.11%. Night vision training and testing were completed on the RGBT-CC dataset. After verification, the average absolute error of the network was 18.16, the mean square error was 32.14, and the recall rate of target detection was 97.65%. By verifying the effectiveness of the multimode medium-term fusion network, it is found to exceed the current most advanced bimodal fusion method. Conclusion. The experimental results show that the proposed multimodal fusion network can solve the counting and detection problem in the video surveillance environment during day and night. The ablation experiment further proves the effectiveness of the parameters of the two models.