“…We start from the 4 kernels with widths [5,11,25,51] similar to that of the sentence parsing CNN architecture of Trang et al [31], which roughly cover sub-phone, phone, syllable and word, and possibly some context. Given the fixed narrow kernel width of 3 frames used in [20,22], we add this to our candidates for testing. From the different combinations presented in Table 3, we observe that the syllable and word width kernel sizes (25,51) helps the performance while including other widths does not change it.…”