The continuous monitoring and recording of food intake amount without user intervention is very useful in the prevention of obesity and metabolic diseases. I adopted a technique that automatically recognizes food intake amount by combining the identification of food types through image recognition and a technique that uses acoustic modality to recognize chewing events. The accuracy of using audio signal to detect eating activity is seriously degraded in a noisy environment. To alleviate this problem, contact sensing methods have conventionally been adopted, wherein sensors are attached to the face or neck region to reduce external noise. Such sensing methods, however, cause dermatological discomfort and a feeling of cosmetic unnaturalness for most users. In this study, a noise-robust and non-contact sensing method was employed, wherein ultrasonic Doppler shifts were used to detect chewing events. The experimental results showed that the mean absolute percentage errors (MAPEs) of an ultrasonic-based method were comparable with those of the audio-based method (15.3 vs. 14.6) when 30 food items were used for experiments. The food intake amounts were estimated for eight subjects in several noisy environments (cafeterias, restaurants, and home dining rooms). For all subjects, the estimation accuracy of the ultrasonic method was not degraded (the average MAPE was 15.02) even under noisy conditions. These results show that the proposed method has the potential to replace the manual logging method.
Moderate performance in terms of intelligibility and naturalness can be obtained using previously established silent speech interface (SSI) methods. Nevertheless, a common problem associated with SSI has involved deficiencies in estimating the spectrum details, which results in synthesized speech signals that are rough, harsh, and unclear. In this study, harmonic enhancement (HE), was used during postprocessing to alleviate this problem by emphasizing the spectral fine structure of speech signals. To improve the subjective quality of synthesized speech, the difference between synthesized and actual speech was established by calculating the distance in the perceptual domains instead of using the conventional mean square error (MSE). Two deep neural networks (DNNs) were employed to separately estimate the speech spectra and the filter coefficients of HE, connected in a cascading manner. The DNNs were trained to incrementally and iteratively minimize both the MSE and the perceptual distance (PD). A feasibility test showed that the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility measure (STOI) were improved by 17.8 and 2.9%, respectively, compared with previous methods. Subjective listening tests revealed that the proposed method yielded perceptually preferred results compared with that of the conventional MSE-based method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.