“…Different types of transfer learning architectures have been explored in speech-based emotion recognition, including the statistical methods (Deng et al, 2013(Deng et al, , 2014aAbdelwahab and Busso, 2015;Song et al, 2015;Sagha et al, 2016;Zong et al, 2016;Song, 2017), the adversarial or generative networks (Chang and Scherer, 2017;Abdelwahab and Busso, 2018;Gideon et al, 2019;Latif et al, 2019), and other neural network structures (Mao et al, 2016;Deng et al, 2017;Gideon et al, 2017;Li and Chaspari, 2019;Neumann and Vu, 2019;Zhou and Chen, 2019). A commonly used input of the aforementioned approaches includes the feature set proposed by the INTERSPEECH emotion challenge and INTERSPEECH paralinguistic challenges (Schuller et al, 2009b, which typically contains the first 12 Mel Frequency Cepstral Coefficients, root-mean-square energy, zero-crossing rate, voice probability, and fundamental frequency (Deng et al, , 2014b, 2017Mao et al, 2016;Sagha et al, 2016;Zhang et al, 2016;Zong et al, 2016;Song, 2017;Abdelwahab and Busso, 2018;Li and Chaspari, 2019;Zhao et al, 2019).…”