“…For example, by recognizing user engagement, the systems can control turntaking behaviors [12,13] and dialogue policies [14,15,16], and increase the quality of user experience through the dialogue. For input features of engagement recognition, we can exploit non-verbal multimodal behaviors such as eye-gaze [17,18,19,20,12,21,15], backchannels (e.g., "yeah") [19,21], laughing [22], head nodding [21], facial movement and direction [17,15], spatial location and distance [23,24,12], and conversational interaction features like adjacency pairs [19]. In addition, direct use of low-level signals such as acoustic and image features was explored [10,25,26,27].…”