“…To address this issue, the authors of [1,10] have proposed a context-driven, i.e., in the form of keywords sequence, medical report generation method for retinal images. Since the context-driven method has multi-modal inputs, i.e., the keywords and image, the authors of [1] exploit the average method to fuse the multi-modal information. However, fusing the multi-modal information by the aver-age method in this case probably cannot effectively capture the interactive information between the context and image [7,6,11,12,13,14,15,16,17,18,19,20,21,22,23,1].…”