“…2) Image Encoder: The image encoder computes an image embedding from the input condition face image. The architecture follows the original implementation without any modification [21]. It contains six layers of 2-D convolutional layers with the following number of filters, kernel sizes, and down-sampling factors: (64, 3, 2), (128, 3, 2), (256, 3, 2), (512, 3, 2), (512, 3, 2), (512, 4, 1), respectively.…”