Image segmentation has played an essential role in computer vision. The target detection model represented by YOLOv5 is widely used in image segmentation. However, YOLOv5 has performance bottlenecks such as object scale variation, object occlusion, computational volume, and speed when processing complex images. To solve these problems, an enhanced algorithm based on YOLOv5 is proposed. MobileViT is used as the backbone network of the YOLOv5 algorithm, and feature fusion and dilated convolution are added to the model. This method is validated on the COCO and PASCAL-VOC datasets. Experimental results show that it significantly reduces the processing time and achieves high segmentation quality with an accuracy of 95.32% on COCO and 96.02% on PASCAL-VOC. The improved model is 116 M, 52 M, and 76 M, smaller than U-Net, SegNet, and Mask R-CNN, respectively. This paper provides a new idea and method with which to solve the problems in the field of image segmentation, and the method has strong practicality and generalization value.