Confronted with severe challenges from quantum computers on public-key cryptography based on traditional number theory, post-quantum cryptography (PQC) has received a substantial amount of attentions. However, suffering from time-consuming polynomial operations, most post-quantum schemes cannot be really applied in practice, especially in high-concurrency scenarios. In this paper, we focus on the post-quantum signature algorithm, DILITHIUM, and present an optimized and highly parallel implementation on Graphics Processing Units (GPU). We give two optimized versions, named single mode and batch mode. In the scheme of single mode, we show efficient implementations of number theoretic transformations (NTT) and other polynomial operations adapted to the scheme. In the batch mode, we improve the reject sampling algorithm for the simultaneous processing of multiple sets of data. Finally, we implement our schemes on GPUs with different architectures, such as Pascal, Volta, and Turing, which shows that the speedup ratio of our schemes increases with the increasing of processed data number. The speedup is up to 11.18× for 15360 groups of data in our experimental environment. In the single mode, We can achieve a speedup of more than 4× for complex operations, such as NTT and its inverse. Other simpler operations such as reduce, caddq and decompose can also achieve an acceleration of 2× ~ 4×. In comparison with the original DILITHIUM scheme testing in the high-performance CPU 6133, our scheme finally has a 1.22× speedup even after taking the communication overhead between CPU and GPU into consideration.