Precise segmentation of surgical instruments is a fundamental component in the development of computer-aided surgery systems by assisting the surgeons to navigate the patient's body aiming to enhance the surgical precision and patient safety. Though real-time tracking of surgical instruments is critically important in invasive computer-assisted surgeries, it is challenging to achieve a highly sensitive and accurate system in complex surgical environment. Recently, synthetic data for instrument segmentation in surgery (Syn-ISS) challenge using synthetic datasets is organized to develop high performance methods for instrument segmentation. In this work, we present encoder and decoder-based hybrid parallel cross window attention-based transformer during the feature extraction, which consists of the multi-scale channel attention, convolutional layers, and Transformer layers, forming a unified block. Syn-ISS challenge dataset comprised of two tasks. In first task1, they need to develop deep learning-based method for binary instrument segmentation and in second task multiclass instrument segmentation is required. Experiments conducted on Syn-ISS dataset achieved 0.993 F-score for task 1 and 0.993, 0.975, and 0.951 F-score for shaft, wrist, and jaw segmentation respectively for Task 2.