As a mainstream heterogeneous programming model, OpenMP has important practical significance for its uninstall performance research. Songshan supercomputer system installed in Zhengzhou Supercomputing Center is a new generation of E-class high-performance computer cluster independently developed by China, and the DCU chip installed on it is also home-made. In order to improve the offload performance of OpenMP on the platform and make full use of hardware resources such as registers, a redundant cycle optimization for thread iteration in DCU was proposed, so that the thread could release register resources in time after completing the calculation task, thus relieving the back-end register allocation pressure and improving the program performance. At the same time, based on the loop unrolling optimization algorithm in LLVM and combined with the hardware characteristics and instruction set characteristics of the domestic platform, a better algorithm for calculating the loop unrolling factor was proposed to improve the optimization effect of loop unrolling. Thread iteration optimization using SPEC ACCEL and Polybench resulted in an average 33.7% reduction in the overall register count and a 37% average performance improvement after loop expansion optimization.