Abstract. The concept of using visual information as part of audio speech processing has been of significant recent interest. This paper presents a data driven approach that considers estimating audio speech acoustics using only temporal visual information without considering linguistic features such as phonemes and visemes. Audio (log filterbank) and visual (2D-DCT) features are extracted, and various configurations of MLP and datasets are used to identify optimal results, showing that given a sequence of prior visual frames an equivalent reasonably accurate audio frame estimation can be mapped.
Today's state-of-the-art hearing instruments (HIs) adapt the sound processing only according to the user's acoustic surrounding. Acoustic ambiguities limit the set of daily life situations where HIs can support the user adequately. State-ofthe-art HIs feature body area networking capabilities. Thus, body-worn sensors could be used to recognize complex user contexts and enhance next-generation HIs. In this work, we identify in a rich real-world data set the mapping between the context of the user-which can be recognized from bodyworn sensors-and the user's current hearing wish. This is the foundation for the implementation of recognition systems for the specific cues in next generation HIs based on on-body sensor data. We discuss how the identified mapping allows selecting a-priori distributions for hearing wishes and HI parameters like the switching sensitivity. We conclude deducing the sensory requirements to realize next generation of networked HIs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.