At present, the video captioning models based on an encoder-decoder mainly rely on a single video input source. The contents of video captioning are limited since few studies employed external corpus information to guide the generation of video captioning, which is not conducive to the accurate description and understanding of video contents. To address this issue, this work proposes a novel video captioning method guided by a sentence retrieval generation network (ED-SRG). First, we integrate a ResNeXt network model, an efficient convolutional network for online video understanding (ECO) model and a long short-term memory (LSTM) network model to construct an encoder-decoder, which are utilized to extract the 2D features, 3D features and object features of video data respectively. These features are decoded to generate textual sentences that conform to video contents for sentence retrieval. Then, a sentence-transformer network model is employed to retrieve different sentences in an external corpus that are semantically similar to the textual sentences, and the candidate sentences are screened out through similarity measurement. Finally, a novel GPT-2 network model is constructed based on GPT-2 network structure. The model introduces a designed random selector to randomly select predicted words with a high probability of appearance in the corpus, which is used to guide and generate textual sentences that are more in line with human natural language expressions. The experiments on common datasets MSVD and MSR-VTT in comparison with some existing works demonstrate that our proposed method can generate sentences with richer semantics and the performance of our method is better than several state-of-the art approaches.