“…Until recently, the majority of driving observation frameworks comprised a manual feature extraction step followed by a classification module (for a thorough overview see [21]). The constructed feature vectors are often derived from hand-and body-pose [2], [3], [6], [7], [38], [39], facial expressions and eye-based input [40], [41], and head pose [42], [43], but also foot dynamics [44], detected objects [6], and physiological signals [45] have been considered. Classification approaches are fairly similar to the ones used in standard video classification, with LSTMs [3], [4], SVMs [2], [46], random forests [47] or HMMs [4], and graph neural networks [7], [48] being popular choices.…”