“…For highlight detection, some researchers propose to represent emotions in a video by a curve on the arousal-valence plane with low-level features such as motion, vocal effects, shot length, and audio pitch (Hanjalic & Xu, 2005), color (Ngo, Ma, & Zhang, 2005), mid-level features such as laughing and subtitles (M. Xu, Luo, Jin, & Park, 2009). Nevertheless, due to the semantic gap between low-level features and high-level semantics, accuracy of highlight detection based on video processing is limited (K.-S. Lin et al, 2013).…”