Approximately 50% of the Earth’s surface is covered by clouds. Optical remote sensing satellites encounter challenges in capturing clear ground images due to the difficulty of visible photoelectric magnetic waves penetrating clouds. Therefore, cloud detection is an essential and basic step in the processing and application of optical satellite remote sensing images. The traditional threshold-based cloud detection methods utilize spectral information of images to set thresholds, which is sensitive and less adaptable to varying imaging conditions. To enhance cloud detection performance, this paper proposes a deep learning method that combines residual network module and pyramid structure of dilated convolution. The method employs an encoder-decoder structure where the residual convolution module replaces conventional convolution kernels, reducing parameter quantity while enhancing feature expression capabilities. Furthermore, a pyramid structure of dilated convolution is introduced between the encoder and decoder to improve the acquisition of global information and reduce misjudgment of cloud pixels. In this work, ablation experiments are conducted to validate the reliability of the proposed network model. To evaluate the effectiveness of the proposed method, GaoFen-1 remote sensing image data was used for experimental verification. The results indicated that compared to traditional methods, the proposed approach achieves satisfactory cloud detection results in various surface types, including barren land, forests, grasslands/crops, and wetlands, while having a smaller model size and parameter quantity.