“…Inspired by the CNN proposed for frame-level melody F0 estimation [3], the frame-level CNN of the acoustic model (Fig. 5) was designed to have six convolution layers with the output channels of 128, 64, 64, 64, 8, and 1 and the kernel sizes of (5, 5), (5,5), (3,3), (3,3), (70, 3), and (1, 1), respectively, where the instance normalization [31] and the Mish function [32] are used. The output dimension of the tatumlevel BLSTM was set to D = 130 × 2.…”