Kidney stones represent a considerable burden for public health-care systems, with the total health-care expenditure for kidney stones exceeding US $ 2 billion annually in the USA alone. Ureteroscopy with laser lithotripsy has evolved as the most commonly used technique for the treatment of kidney stones. Automated segmentation of kidney stones and laser fiber is an important initial step to performing any automated quantitative analysis of the stones, particularly stone-size estimation, that can be used by the surgeon to decide if the stone requires further fragmentation. Factors such as turbid fluid inside the cavity, specularities, motion blur due to kidney movements and camera motion, bleeding, and stone debris impact the quality of vision within the kidney and lead to extended operative times. To the best of our knowledge, this is the first attempt made towards multi-class segmentation in ureteroscopy and laser lithotripsy data. We propose an end-to-end convolution neural network (CNN) based learning framework for the segmentation of stones and laser fiber. The proposed approach utilizes two sub-networks: I) HybResUNet, a hybrid version of residual U-Net, that uses residual connections in the encoder path of U-Net to improve semantic predictions, and II) a DVFNet that generates deformation vector field (DVF) predictions by leveraging motion differences between the adjacent video frames which is then used to prune the prediction maps. We also present ablation studies that combine different dilated convolutions, recurrent and residual connections, atrous spatial pyramid pooling and attention gate model. Further, we propose a compound loss function that significantly boosts the segmentation performance in our data. We have also provided an ablation study to determine the optimal data augmentation strategy for our dataset. Our qualitative and quantitative results illustrate that our proposed method outperforms state-of-the-art methods such as UNet and DeepLabv3+ showing an improvement of 5.2% and 15.93%, respectively, for the combined mean of DSC and JI in our in vivo test dataset. We also show that our proposed model generalizes better on a new clinical dataset showing a mean improvement of 25.4%, 20%, and 11% over UNet, HybResUNet, and DeepLabv3+, respectively, for the same metric.