To store the constant transformation matrix while using the classic Winograd algorithm on FPGAs, a large amount of on-chip resources needs to be used, which will reduce the model’s throughput and on-chip resource usage. To address these issues, this paper designs a hardware accelerator that combines the Winograd algorithm and GEMM. In this accelerator, a Winograd-GEMM-Shared Processing Element (WGS PE) is designed to share on-chip computing resources, switch between different convolutional cores flexibly, and reuse the data in the buffer. Meanwhile, this paper adopts dynamic 16-bit fixed-point quantization, multi-level line cache pipelining, and im2col+GEMM optimization strategies for the WGS PE, which significantly improves the model's throughput and resource efficiency. In the experiment, the YOLOv2 algorithm was implemented and tested using the COCO dataset on the FPGA. The detection latency of a single image was decreased to 168.68ms, and the detection accuracy was 80.94\%; meanwhile, the efficiency of DSP resources usage was 0.61, showing 1.4x DSP usage efficiency compared to other research results; besides, the throughput was 65.98 GOP/S, and the total power consumption is 3.26 W. The experimental results demonstrate that the designed hardware accelerator can fully utilize on-chip computing resources and has great potential deployment on FPGAs to accelerate target detection.