Emerging fields such as Augmented Reality (AR), Virtual Reality (VR), Ecommerce platforms like Amazon, Ebay etc and Social media sites make use of images with text information for creating immersive experiences, selling products and sharing posts. Generating captions from these images has large potential effects, such as descriptions associated with medical images, text-based image retrieval, information accessed for visually impaired users, human-robot interaction etc. Existing image captioning systems fail to recognize the text on the image which leads to loss of context information such as additional information regarding location, poster contents, product specifications etc. The objective of our proposed system is to design a model that would generate a brief description for a given image at a semantic level using Clip-GPT2 and OCR techniques along with a sentence fusion model BART, for an image with text. Capturing text within images during caption generation outperforms the existing models, with a Rogue value of 81.66. Thus our system helps communicate the contextual information of an image to a visually impaired user as an informative textual description.