Deep learning based on Convolutional Neural Network (CNN) has shown promising results in various vision-based applications, recently also in camera-based vital signs monitoring. The CNN-based Photoplethysmography (PPG) extraction has, so far, been focused on performance rather than understanding. In this paper, we try to answer four questions with experiments aiming at improving our understanding of this methodology as it gains popularity. We conclude that the network exploits the blood absorption variation to extract the physiological signals, and that the choice and parameters (phase, spectral content, etc.) of the reference-signal may be more critical than anticipated. The availability of multiple convolutional kernels is necessary for CNN to arrive at a flexible channel combination through the spatial operation, but may not provide the same motion-robustness as a multi-site measurement using knowledge-based PPG extraction. Finally, we conclude that the PPG-related prior knowledge is still helpful for the CNN-based PPG extraction. Consequently, we recommend further investigation of hybrid CNN-based methods to include prior knowledge in their design.
IntroductionRemote Photoplethysmography (remote-PPG) is a contactless way to measure human cardiovascular activity by measuring the reflection variations of the skin registered by a video camera [1]. Over the last decade, various remote-PPG methods [2-7] have been proposed for PPG-signal extraction. The methods differ in their choice of assumptions [2-5] and use of handcrafted features (e.g. projected color features of CHROM [4] and POS [5]), while these choices affect their robustness with respect to illumination variations and subject motion.
Related workRecently, the success of deep Convolutional Neural Network (CNN) methods that automatically learn relevant features from images/videos in various applications has inspired researchers to attempt CNN-based remote-PPG extraction [8][9][10][11][12]. Chen and McDuff [8] proposed a convolutional attention network consisting of two parallel models to extract the PPG signal from a video. The first model is a classical "appearance model" [13] that learns to find the skin regionof-interest (RoI), while the second parallel path fed with DC-normalized frame-differences from the RoI learns to extract the PPG signal, using a finger oximeter-derived signal as a reference. In [8], the second model is referred to as a "motion model", but we prefer to use the term "normalized frame difference model", since our work shall prove that it exploits the blood absorption variation rather than the skin motion as suggested by [8]. SynRhythm [9] is a general-to-specific transfer learning method. The authors directly convert the spatial-temporal features into heart rate based on the pre-trained network [14]. HR-CNN [10] consists of the extractor CNN and the HR-estimator CNN with different loss functions to predict the heart rate, rather than the PPG signal. PhysNet [11] is a 3D CNN which learns the temporal and spatial context features of f...