Conventional image processing and machine learning based on handcrafted features struggle to meet the real-time and high-accuracy requirements for industrial defect detection in complex, sensitive, and dynamic environments. To address this issue, this paper proposes AENet, a novel real-time defect detection network based on an encoder-decoder model, which achieves high detection accuracy and efficiency while demonstrating good convergence and generalization. Firstly, A spatial channel attention (SCA) module in the encoding network is designed to integrate spatial attention and channel attention using a multi-head 3D self-attention mechanism. This improves parallelism and detection efficiency. Secondly, the decoding network of AENet incorporates the Cross-Level Attention Fusion (CLAF) module, which fuses input features from different layers. Combined with multi-level upsampling, this enhances the representation of defect features. Furthermore, we insert a simplified aggregator into the encoder-decoder network of AENet to extract feature information at different scales with low computational cost. This aggregation process aids in training and inference on industrial defect datasets by incorporating contextual information. Extensive experimental results demonstrate that AENet outperforms other segmentation models in accomplishing defect recognition and segmentation in challenging optical environments. It exhibits a faster convergence than other networks and a balance between accuracy and speed. It achieves a recognition precision of over 96% for almost all types of defects in the actual industrial environment on the NVIDIA Tesla V100 GPU.