Indian Sign Language (ISL) serves as a vital means of communication for the hearing-impaired community in India. Accurate recognition of ISL gestures through computer vision is of paramount importance for enhancing accessibility and inclusivity. Hence this research focuses on translating sign language gestures used by the hearing impairment community into formats understandable by the general population in order to bridge the communication gap between these communities. For this a Continuous Sign language recognition module is to be designed which is complicated since the grammar for Sign language is different from the spoken language due to which first the continuous ISL (Indian Sign language) is to be converted to glosses and then these glosses are to be used for generating the spoken language. Also, as per literature it is observed that Sign language translator is built for American, Chinese and Argentina Sign language but very little work is done on Indian Sign language. Also, many of the ISL translators are built either on static data or very a smaller number of gestures of video data [20]. In our work it is proposed to build a system which uses combinational network that can convert directly the ISL to Speech on 76 video gestures. The proposed combinational network includes Pre-trained network designs such as ResNet18, ResNet50, GoogLeNet, and InceptionV3 to efficiently extract spatial features from video frames and subsequently, these extracted features are further processed through a two-layer Long Short-Term Memory (Bi-LSTM) network to represent time dependencies between the frames for a particular gesture. Compared to conventional RNNs, BiLSTM models are used since they were able to represent well the longer time dependencies of frames in the gesture. To validate the proposed idea a standard balanced database of around 76 gestures with each gesture enacted by 10 individuals 05 times each which includes letters, words, phrases are created in Anechoic Chamber lab using Sony HXR-NX100 camera sponsored under UGC-MRP at JNTUHCEH. We explored various combinations of pre-trained networks and BiLSTM layers to strike a balance between computational resources and precision on this database and we could achieve incredible accuracy in gesture classification while minimizing training time and memory usage. GoogLenet with LSTM gave better results with an average test accuracy of 94.21% compared to other combinational networks.