Deep learning-based ECG interpretation has recently surfaced as a promising technique in automating report generation to reduce time and expense at the clinic. However, ECG descriptions generated by previous methods show considerable limitations compared with the diagnosis from cardiologists. This research proposes a deep network for extracting features by concatenating two networks: 1D input signal and 2D input spectrogram. Conjoining characteristics 1D and 2D push the captioning network based on the Transformer, focusing on complicated ECG components such as the P and T waves. Besides, we propose a novel approach for the prior knowledge module that utilizes the CO-attention Siamese Network to increase accuracy and performance to comprehend unique ECG characteristics. We also suggest an efficient augmentation technique to reduce the data imbalance and the FNet module to optimize the attention blocks in the Transformer design. The performance is validated using the PhysioNet Challenge 2021 database with approximately 88,243 twelve-lead ECG recordings. We train the model on free-text annotations of ECGs; the BLEU1, BLEU2, BLEU3, BLEU4, METEOR, ROUGE, and CIDEr scores are 0.88, 0.87, 0.81, 0.79, 0.62, 0.88, and 8.35, respectively, compared our output with generated captions based on the clinic standard. Using the public China Physiological Signal Challenge dataset, our proposal outperforms the earlier efforts for classification problems on the Sinus, First-degree atrioventricular block, and PVC rhythms. This study shows that a deep neural network appropriately trained on unstructured free-text physician annotations might lessen cardiologists' clinical workload when interpreting cardiovascular illness for patients. According to the concept, captioning ECG reports via a deep neural network may assist clinicians in evaluating ECG and reducing clinical workload.