“…Therefore, similar to the computer vision tasks [81], the deeper CNN architecture usually achieves better performance. A number of CNN architectures, which have been proposed for typical computer vision tasks, also show great success in gaze estimation task, e.g., LeNet [26], AlexNet [43], VGG [42], ResNet18 [36] and ResNet50 [82]. Besides, some well-designed modules also help to improve the estimation accuracy [46], [49], [83], [84] , e.g., Chen et al propose to use dilated convolution to extract features from eye images [46], Cheng et al propose an attention module for fusing two eye features [49].…”