Unmanned aerial vehicle (UAV) aerial photography technology has become a crucial tool for detecting outfalls that discharge into rivers and oceans. However, the current retrieval process in aerial images relies heavily on visual interpretation by skilled experts, which is time-consuming and inefficient. To address this issue, we propose a lightweight deep learning model for detecting outfall objects in aerial images. Specifically, the backbone of our proposed model is a Lightweight Convolutional Vision Transformer network (LCVT), which consists of two novel blocks: Separated Down-sampled Self-Attention (SDSA) and Convolutional Feed-Forward Network with Shortcut (CFNS). These blocks are designed to capture information at different granularities in the feature map and build both local and global representations. The model utilizes a Path Aggregation Feature Pyramid Network (PAFPN) as the neck and a lightweight decoupled network as the head. The experiments demonstrate that our model achieves the highest accuracy of 81.5% while utilizing only 2.47 M parameters and 3.95 GFLOPs. Visualization analysis shows that our model pays more attention to true outfall objects. Additionally, we have developed an intelligent outfall detection system based on the proposed model, and experimental results show that it performs well in the task of outfall detection. The model and code are available at https://github.com/ISCLab-Bistu/LCVT.