“…Also, a vast majority of our sub-systems use energy-based voice activity detector (VAD) in view of its simplicity and effectiveness. Other options for VAD that have been adopted are (i) VQ-VAD [21] in Sys1 and Sys14, (ii) speech/non-speech probabilities inferred from the DNN senone posterior in Sys9, and (iii) two-channel VAD [22] [14,15,27], there are a handful of our sub-systems (six out of seventeen in Table 3) that have successfully incorporated deep learning in one form or another: (i) Deep bottleneck feature (DBF) in Sys9, (ii) Stacked bottleneck feature in Sys11, (iii) DNN posterior in Sys2, Sys9, Sys10, Sys16, (iv) Splice time delay DNN (TDNN) [16] in Sys2, and (v) Denoising autoencoder in Sys14. For the bottleneck features in Sys9 we used a DNN with seven hidden layers each having 1024 hidden units except for the third layer with only 80 units.…”