“…The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22].…”