In this paper, a parallel decoder and a word group prediction module are proposed to speed up decoding and improve the effect of captions. The features of the image extracted by the encoder are linearly projected to different word groups, and then a unique relaxed mask matrix is designed to improve the decoding speed and the caption effect. First, since image captioning is composed of many words, sentences can also be broken down into word groups or words according to their syntactic structure, and we achieve this function through constituency parsing. Second, we make full use of the extracted features to predict the size of word groups. Then, a new embedding representing the information of the word is proposed based on word embedding. Finally, with the help of word groups, we design a mask matrix to modify the decoding process so that each step of the model can produce one or more words in parallel. Experiments on public datasets demonstrate that our method can reduce the time complexity while maintaining competitive performance.